# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [2]:
# your code here
import pandas as pd
import numpy as np

### Fill in the user, password and ip fields when using this notebook

In [1]:
driver = 'mysql+pymysql:'
user = 'USERNAME'
password = 'PASSWORD'
ip = 'IP'
database = 'retail_sales'

In [2]:
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'

This will connect to the database and then return the table names.

In [3]:
engine = create_engine(connection_string)
engine.table_names()

NameError: name 'create_engine' is not defined

The query will get all the data from the raw_sales table that we will use to create the aggregates.

In [19]:
query = 
"""
SELECT * FROM raw_sales;
"""

SyntaxError: invalid syntax (<ipython-input-19-f78594eab54d>, line 1)

In [20]:
raw_sales = pd.read_sql(query, engine)

NameError: name 'query' is not defined

Now that the data is loaded let's take a general look at it to understand if it needs cleaning.

In [24]:
raw_sales.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [26]:
raw_sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null object
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 213.0+ KB


For a start we can convert the date column to datetime. Also, the item_cnt_day can be converted to int given that the number of items is always an integer number (you can't sell half an item).

In [28]:
raw_sales.date = pd.to_datetime(raw_sales.date)

In [32]:
raw_sales.item_cnt_day = raw_sales.item_cnt_day.astype('int')

Let's check if they're correct now.

In [33]:
raw_sales.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day             int64
dtype: object

Perfect, now we can start by using a describe to look at some descriptive statistics of our data.

In [34]:
raw_sales.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


The maximum price (27990) is extremely higher than the average, and the quartiles clearly shows that the majority of prices reside in the 249-1192 range. One possibility is to assume that there may be some outliers and therefore remove them.

One option is to use the Interquartile Range (IQR) to set a lower and an upper limit by multiplying the IQR by 1.5. Then  we can subtract that value from the Q1 (or 25% in the describe table) limit and add it to the Q2 (or 75% in the describe table) to have an upper and lower bound. Any value outside these limits can be considered an outlier.

In [43]:
first_q, third_q = raw_sales.item_price.quantile([0.25, 0.75])

In [44]:
iqr = third_q - first_q

In [54]:
lower_limit = first_q - (iqr * 1.5)
upper_limit = third_q + (iqr * 1.5)

Now that we have the two limits we will only keep the rows of the DataFrame where the item_price is higher than the lower_limit and lower than the upper_limit.

In [58]:
raw_sales_clean = raw_sales.loc[(raw_sales.item_price > lower_limit) & (raw_sales.item_price < upper_limit)]

We can now create the per_shop and per_item aggregates.

In [76]:
shop_agg = raw_sales_clean.loc[:, ['shop_id', 'item_price', 'item_cnt_day']].groupby('shop_id').sum()

In [78]:
shop_agg.columns = ['total_sales', 'total_num_items_sold']

In [79]:
shop_agg.head()

Unnamed: 0_level_0,total_sales,total_num_items_sold
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,39469.5,69
3,21279.0,24
4,20964.0,36
5,22068.0,42
6,64794.0,135


In [80]:
item_agg = raw_sales_clean.loc[:, ['item_id', 'item_price', 'item_cnt_day']].groupby('item_id').sum()

In [82]:
item_agg.columns = ['item_earning', 'num_items_sold']

In [83]:
item_agg.head()

Unnamed: 0_level_0,item_earning,num_items_sold
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30,507.0,3
31,1089.0,3
32,447.0,3
42,897.0,3
59,747.0,3


Now that we have our aggregates and our clean table we can load them into our local database.

Remember to change the credentials so that they match your local database.

In [85]:
driver = 'mysql+pymysql'
user = 'USER'
password = 'PASSWORD'
ip = 'IP'
database = 'NAME'

In [86]:
connection_string = f'{driver}://{user}:{password}@{ip}/{database}'

In [None]:
engine = create_engine(connection_string)

In [None]:
raw_sales_clean.to_sql('raw_sales_clean', engine)
sales_by_shop_agg.to_sql('sales_by_shop_agg', engine)
sales_by_item_agg.to_sql('sales_by_item_agg', engine)