# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [13]:
import pandas as pd
from sqlalchemy import create_engine
import pymysql

In [14]:
#driver = 'mysql+pymysql'
#ip = '34.65.10.136'
#username = 'data-students'
#password = 'iR0nH@cK-D4T4B4S3'
#db = 'retail_sales'
#connection_string  = f'{driver}://{username}:{password}@{ip}/{db}'

In [15]:
driver = 'mysql+pymysql'
ip = "127.0.0.1"
username = 'root'
password = 'root'
db = 'retail_sales'
connection_string  = f'{driver}://{username}:{password}@{ip}/{db}'

In [16]:
# Engine & Query
engine = create_engine(connection_string)
query = 'SELECT * FROM raw_sales'

# Database Request
raw_sales = pd.read_sql(query,engine)
raw_sales.head(10)

  result = self._query(query)


Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
5,2015-01-04,28,21842,149.0,1.0
6,2015-01-04,28,21881,299.0,1.0
7,2015-01-04,29,6930,2199.0,1.0
8,2015-01-04,29,10515,169.0,1.0
9,2015-01-04,29,8624,149.0,1.0


Data Cleaning

1) I created a copy of the dataset to keep the original unmodified

In [17]:
raw_sales2=raw_sales.copy()

2) I checked for missing values. There weren't any missing values in this dataset

In [18]:
raw_sales2.isnull().sum() # by using sum() I got the missing values per column

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

3) I dropped all columns that were irrelevant for my analysis

In [19]:

raw_sales2.drop("date", inplace=True, axis=1)


In [20]:
raw_sales2.head(10)

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
0,29,1469,1199.0,1.0
1,28,21364,479.0,1.0
2,28,21365,999.0,2.0
3,28,22104,249.0,2.0
4,28,22091,179.0,1.0
5,28,21842,149.0,1.0
6,28,21881,299.0,1.0
7,29,6930,2199.0,1.0
8,29,10515,169.0,1.0
9,29,8624,149.0,1.0


4) I grouped the data by shop_id in order to get the aggregates

In [22]:
raw_sales2=raw_sales2.groupby("shop_id").sum()

In [23]:
(raw_sales2.groupby("shop_id").sum()).head(10)

Unnamed: 0_level_0,item_id,item_price,item_cnt_day
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,966879,99070.5,81.0
3,335745,67443.0,33.0
4,498624,29361.0,39.0
5,620868,33138.0,45.0
6,1266894,116352.0,150.0
7,669045,52371.0,63.0
10,310137,22707.0,30.0
12,1647339,212196.4,216.0
14,421977,33456.0,51.0
15,1210026,125139.0,93.0


5) I analized descriptive statistics in the dataset

In [26]:
raw_sales2.groupby("shop_id").max().describe()

Unnamed: 0,item_id,item_price,item_cnt_day
count,45.0,45.0,45.0
mean,1125186.0,104200.298222,111.466667
std,918021.0,76781.42107,87.046905
min,200076.0,9285.0,18.0
25%,602433.0,47109.0,60.0
50%,798573.0,85737.0,78.0
75%,1266894.0,135318.0,150.0
max,3976677.0,327864.0,402.0


With the aggregate per store table I calculated the revenue by store

In [11]:
raw_sales2["revenue"]=raw_sales2["item_price"]*raw_sales2["item_cnt_day"]

In [12]:
raw_sales2.head(10)

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day,revenue
0,29,1469,1199.0,1.0,1199.0
1,28,21364,479.0,1.0,479.0
2,28,21365,999.0,2.0,1998.0
3,28,22104,249.0,2.0,498.0
4,28,22091,179.0,1.0,179.0
5,28,21842,149.0,1.0,149.0
6,28,21881,299.0,1.0,299.0
7,29,6930,2199.0,1.0,2199.0
8,29,10515,169.0,1.0,169.0
9,29,8624,149.0,1.0,149.0
