# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

## Importing the data

In [1]:
# your code here
import pandas as pd
from sqlalchemy import create_engine    


In [2]:
#Connection to Ironhack Database
driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
engine = create_engine(connection_string)



In [3]:
#Show tables in DB
pd.read_sql('SHOW TABLES;', engine)

Unnamed: 0,Tables_in_retail_sales
0,raw_sales


In [5]:
#Create dataframe with raw_sales
def raw_sales():
        raw_sales = pd.read_sql('SELECT * FROM raw_sales;', engine)
        return raw_sales

raw_sales().head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


## General analysis of the data

In [6]:
raw_sales = raw_sales()

In [7]:
#Top 5 values for a quick glance
raw_sales.head(5)

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0


In [8]:
#All column data types
raw_sales.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

In [9]:
#Quick overview of the data
raw_sales.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [10]:
#Checking for an null/NaN values
raw_sales.isna().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [11]:
#Check for an null/NaN values
raw_sales.isna().values.any()

False

In [12]:
#Total number of unique shops
len(raw_sales.shop_id.unique())

45

In [13]:
#Total number of unique items
len(raw_sales.item_id.unique())

985

In [14]:
#Total number of unique prices
len(raw_sales.item_price.unique())

223

In [15]:
#Count of item_cnt_day. Negative values are returns
raw_sales.item_cnt_day.unique()

array([ 1.,  2.,  6.,  3., -1.,  4.,  5., 10.])

In [16]:
#Count number of item_cnt_day
raw_sales['item_cnt_day'].value_counts()

 1.0     4161
 2.0      264
 3.0       42
 4.0       30
-1.0       30
 5.0        9
 6.0        6
 10.0       3
Name: item_cnt_day, dtype: int64

In [17]:
#Identifying returned items
returned_items = raw_sales.item_cnt_day < 0
returned_items.value_counts()

False    4515
True       30
Name: item_cnt_day, dtype: int64

In [18]:
#Identifying negatively price items
negative_item_price = raw_sales.item_price < 0
negative_item_price.value_counts()

False    4545
Name: item_price, dtype: int64

In [19]:
#Checking bins overview to identify possible outliers
raw_sales.item_price.value_counts(bins=5)

(-24.988, 5600.4]     4500
(22392.6, 27990.0]      18
(5600.4, 11197.8]       18
(11197.8, 16795.2]       6
(16795.2, 22392.6]       3
Name: item_price, dtype: int64

In [20]:
#Add price range column
price_range_label = ['Very cheap', 'Cheap', 'Average', 'Expensive', 'Very Expensive']
bins = pd.cut(raw_sales['item_price'],len(price_range_label), labels = price_range_label)
bins_series = pd.Series(bins)
raw_sales['price_range'] = bins_series
raw_sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,price_range
0,2015-01-04,29,1469,1199.0,1.0,Very cheap
1,2015-01-04,28,21364,479.0,1.0,Very cheap
2,2015-01-04,28,21365,999.0,2.0,Very cheap
3,2015-01-04,28,22104,249.0,2.0,Very cheap
4,2015-01-04,28,22091,179.0,1.0,Very cheap
...,...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0,Very cheap
4541,2015-01-04,14,21922,99.0,1.0,Very cheap
4542,2015-01-04,15,1969,3999.0,1.0,Very cheap
4543,2015-01-04,14,22091,179.0,1.0,Very cheap


# Grouping the data

## Grouping by item

In [22]:
#All items sold by price, date, count per day, and item revenue
items_sold = raw_sales[['shop_id','item_id', 'item_price', 'date', 'item_cnt_day']]
items_sold = items_sold.sort_values("item_id", ascending=True)
items_sold['item_revenue'] = items_sold['item_price'].multiply(items_sold['item_cnt_day'])
items_sold.head()

Unnamed: 0,shop_id,item_id,item_price,date,item_cnt_day,item_revenue
1597,28,30,169.0,2015-01-04,1.0,169.0
3112,28,30,169.0,2015-01-04,1.0,169.0
82,28,30,169.0,2015-01-04,1.0,169.0
1331,6,31,363.0,2015-01-04,1.0,363.0
2846,6,31,363.0,2015-01-04,1.0,363.0


In [23]:
#Group items sold by item_id
#This is hack to count item_id
items_sold_by_id = items_sold.groupby(['date','item_id']).agg({'shop_id':'count'})
items_sold_by_id
items_sold_by_id.rename(columns = {'shop_id': 'total_sold'}, inplace=True)
items_sold_by_id.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_sold
date,item_id,Unnamed: 2_level_1
2015-01-04,30,3
2015-01-04,31,3
2015-01-04,32,3
2015-01-04,42,3
2015-01-04,59,3


In [24]:
#Group items revenue by item_id
items_revenue = items_sold.groupby(['date','item_id']).agg({'item_revenue':'sum'})
items_revenue

Unnamed: 0_level_0,Unnamed: 1_level_0,item_revenue
date,item_id,Unnamed: 2_level_1
2015-01-04,30,507.0
2015-01-04,31,1089.0
2015-01-04,32,447.0
2015-01-04,42,897.0
2015-01-04,59,747.0
2015-01-04,...,...
2015-01-04,22091,1074.0
2015-01-04,22092,537.0
2015-01-04,22104,1494.0
2015-01-04,22140,652.5


In [25]:
items_by_qty_revenue = pd.merge(items_revenue, items_sold_by_id , on='item_id')
items_by_qty_revenue = items_by_qty_revenue.sort_values('item_id')
items_by_qty_revenue

Unnamed: 0_level_0,item_revenue,total_sold
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30,507.0,3
31,1089.0,3
32,447.0,3
42,897.0,3
59,747.0,3
...,...,...
22091,1074.0,6
22092,537.0,3
22104,1494.0,3
22140,652.5,3


### Export items by quantity and revenue into csv

## Grouping by shop

In [28]:
#Group items sold by item_id
items_revenue = items_sold.groupby(['date','shop_id','item_id']).agg({'item_revenue':'sum'})
items_revenue.rename(columns = {'item_revenue': 'total_revenue'}, inplace=True)
items_revenue.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_revenue
date,shop_id,item_id,Unnamed: 3_level_1
2015-01-04,2,1970,26997.0
2015-01-04,2,1971,13497.0
2015-01-04,2,2871,2997.0
2015-01-04,2,2881,2997.0
2015-01-04,2,3028,7797.0


In [29]:
#Revenue by shop
shop_revenue = items_revenue.groupby(['date','shop_id']).agg({'total_revenue':'sum'})
shop_revenue.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_revenue
date,shop_id,Unnamed: 2_level_1
2015-01-04,2,103746.0
2015-01-04,3,67443.0
2015-01-04,4,29361.0
2015-01-04,5,33138.0
2015-01-04,6,138678.0


In [30]:
#Group sold items by shop_id
items_by_shop = raw_sales.groupby(['date','shop_id']).agg({'item_id':'count'})
items_by_shop.rename(columns = {'item_id': 'items_sold'}, inplace=True)
items_by_shop.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,items_sold
date,shop_id,Unnamed: 2_level_1
2015-01-04,2,75
2015-01-04,3,33
2015-01-04,4,39
2015-01-04,5,45
2015-01-04,6,126


In [31]:
shops_by_quantity_revenue = pd.merge(items_by_shop, shop_revenue, on='shop_id')
shops_by_quantity_revenue.head()

Unnamed: 0_level_0,items_sold,total_revenue
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,75,103746.0
3,33,67443.0
4,39,29361.0
5,45,33138.0
6,126,138678.0
