# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
import pandas as pd

from sqlalchemy import create_engine

#### Importing database through sqlalchemy

In [2]:


driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'

In [3]:
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
print(connection_string)

mysql+pymysql://data-students:iR0nH@cK-D4T4B4S3@34.65.10.136/retail_sales


In [4]:
engine = create_engine(connection_string)
print(engine)

Engine(mysql+pymysql://data-students:***@34.65.10.136/retail_sales)


In [5]:
retail_db = pd.read_sql(f'SHOW TABLES', engine)
retail_db.head()

Unnamed: 0,Tables_in_retail_sales
0,raw_sales


In [6]:
raw_sales = pd.read_sql('SELECT * FROM raw_sales', engine)

In [7]:
raw_sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [8]:
#checking for NaN & error values to clean
raw_sales.isna().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [9]:
#creating copies to clean
original = raw_sales.copy()

raw_sales_data = raw_sales.copy()

In [10]:
original

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


#### Creating new column total revenue to measure profits

In [11]:
raw_sales_data ['total_revenue'] = raw_sales_data['item_price'] * raw_sales_data['item_cnt_day']

In [12]:
raw_sales_data

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,total_revenue
0,2015-01-04,29,1469,1199.0,1.0,1199.0
1,2015-01-04,28,21364,479.0,1.0,479.0
2,2015-01-04,28,21365,999.0,2.0,1998.0
3,2015-01-04,28,22104,249.0,2.0,498.0
4,2015-01-04,28,22091,179.0,1.0,179.0
...,...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0,1299.0
4541,2015-01-04,14,21922,99.0,1.0,99.0
4542,2015-01-04,15,1969,3999.0,1.0,3999.0
4543,2015-01-04,14,22091,179.0,1.0,179.0


### Calculating KPIs over revenue

#### Aggregating per item & per store that adds up the rest of the values.

In [34]:
raw_sales_data.groupby(['shop_id']).agg({'total_revenue':['mean', 'min', 'max']})

Unnamed: 0_level_0,total_revenue,total_revenue,total_revenue
Unnamed: 0_level_1,mean,min,max
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2,1383.28,28.0,8999.0
3,2043.727273,500.0,8999.0
4,752.846154,79.0,2799.0
5,736.4,99.0,3690.0
6,1100.619048,5.0,7796.0
7,831.285714,99.0,3999.0
10,841.333333,6.0,2456.0
12,2049.8125,79.0,17998.0
14,1276.666667,58.0,11997.0
15,1345.580645,49.0,19990.0


In [35]:
raw_sales_data.groupby([('item_id')]).agg({'total_revenue':['mean', 'min', 'max']})

Unnamed: 0_level_0,total_revenue,total_revenue,total_revenue
Unnamed: 0_level_1,mean,min,max
item_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
30,169.0,169.0,169.0
31,363.0,363.0,363.0
32,149.0,149.0,149.0
42,299.0,299.0,299.0
59,249.0,249.0,249.0
...,...,...,...
22091,179.0,179.0,179.0
22092,179.0,179.0,179.0
22104,498.0,498.0,498.0
22140,217.5,217.5,217.5


In [15]:

shop_sales = raw_sales_data.groupby(['shop_id']).agg(['mean'])
shop_sales

Unnamed: 0_level_0,item_id,item_price,item_cnt_day,total_revenue
Unnamed: 0_level_1,mean,mean,mean,mean
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2,12891.72,1320.94,1.08,1383.28
3,10174.090909,2043.727273,1.0,2043.727273
4,12785.230769,752.846154,1.0,752.846154
5,13797.066667,736.4,1.0,736.4
6,10054.714286,923.428571,1.190476,1100.619048
7,10619.761905,831.285714,1.0,831.285714
10,11486.555556,841.0,1.111111,841.333333
12,11439.854167,1473.586111,1.5,2049.8125
14,9377.266667,743.466667,1.133333,1276.666667
15,13011.032258,1345.580645,1.0,1345.580645


In [16]:
# One aggregate per store that adds up the rest of the values.

shop_sales2 = raw_sales_data.groupby(['shop_id']).agg(['mean', 'count'])

In [17]:
shop_sales

Unnamed: 0_level_0,item_id,item_price,item_cnt_day,total_revenue
Unnamed: 0_level_1,mean,mean,mean,mean
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2,12891.72,1320.94,1.08,1383.28
3,10174.090909,2043.727273,1.0,2043.727273
4,12785.230769,752.846154,1.0,752.846154
5,13797.066667,736.4,1.0,736.4
6,10054.714286,923.428571,1.190476,1100.619048
7,10619.761905,831.285714,1.0,831.285714
10,11486.555556,841.0,1.111111,841.333333
12,11439.854167,1473.586111,1.5,2049.8125
14,9377.266667,743.466667,1.133333,1276.666667
15,13011.032258,1345.580645,1.0,1345.580645


In [18]:
shop_sales2 = raw_sales_data.groupby(['shop_id']).agg(['mean', 'count'])

In [19]:
shop_sales

Unnamed: 0_level_0,item_id,item_price,item_cnt_day,total_revenue
Unnamed: 0_level_1,mean,mean,mean,mean
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2,12891.72,1320.94,1.08,1383.28
3,10174.090909,2043.727273,1.0,2043.727273
4,12785.230769,752.846154,1.0,752.846154
5,13797.066667,736.4,1.0,736.4
6,10054.714286,923.428571,1.190476,1100.619048
7,10619.761905,831.285714,1.0,831.285714
10,11486.555556,841.0,1.111111,841.333333
12,11439.854167,1473.586111,1.5,2049.8125
14,9377.266667,743.466667,1.133333,1276.666667
15,13011.032258,1345.580645,1.0,1345.580645


In [36]:
# # One aggregate per item that adds up the rest of the values.

# item_sales = raw_sales_data.groupby(['item_id']).agg(['count', 'min', 'max'])

# item_sales
# # item_sales.agg(['item_price', 'item_cnt_day'])

In [37]:
# item_sales2 = raw_sales_data.groupby(['item_cnt_day']).agg(['mean', 'min', 'max'])
# item_sales2