# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [82]:
import pandas as pd
import numpy as np

### Part One. Analyse / Import Data


In [104]:
df = pd.read_csv('raw_sales.csv', sep = ";")
df.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04 00:00:00,29,1469,1199.0,1.0
1,2015-01-04 00:00:00,28,21364,479.0,1.0
2,2015-01-04 00:00:00,28,21365,999.0,2.0
3,2015-01-04 00:00:00,28,22104,249.0,2.0
4,2015-01-04 00:00:00,28,22091,179.0,1.0


In [84]:
df.tail()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
4540,2015-01-04 00:00:00,15,4240,1299.0,1.0
4541,2015-01-04 00:00:00,14,21922,99.0,1.0
4542,2015-01-04 00:00:00,15,1969,3999.0,1.0
4543,2015-01-04 00:00:00,14,22091,179.0,1.0
4544,2015-01-04 00:00:00,15,1007,1199.0,1.0


In [85]:
df.columns

Index(['date', 'shop_id', 'item_id', 'item_price', 'item_cnt_day'], dtype='object')

In [86]:
# Overview of the data set. 
df.describe()

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [87]:
# Information regarding the types of data contained with the data set.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          4545 non-null   object 
 1   shop_id       4545 non-null   int64  
 2   item_id       4545 non-null   int64  
 3   item_price    4545 non-null   float64
 4   item_cnt_day  4545 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 177.7+ KB


### Part 2. Clean Data

In [88]:
# Removing values less than zero.
df.item_cnt_day[df.item_cnt_day < 0] = 0
df.describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.110231
std,16.565517,6558.649572,2073.91999,0.516832
min,2.0,30.0,3.0,0.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [89]:
# This double check's if there are any NaN (not a number) values within the DataFrame. There doesn't appear to be any
# so the dataset is suprisingly clean. 
df.isna().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [97]:
# Lets convert item_price and count (as they never have decimal values) to ints to make the dataset more cohesive.
df = df.astype({'item_price': 'int64','item_cnt_day': 'int64'})
df.dtypes

date            object
shop_id          int64
item_id          int64
item_price       int64
item_cnt_day     int64
dtype: object

In [99]:
# Convert object to datetime
df['date'] = pd.to_datetime(df['date'])
df.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price               int64
item_cnt_day             int64
dtype: object

### Part 3. Aggregate Data

-One aggregate per store that adds up the rest of the values.



In [100]:
per_store = df.groupby('shop_id').agg({'item_price':['sum','mean','min','max','count']})
per_store.head()

Unnamed: 0_level_0,item_price,item_price,item_price,item_price,item_price
Unnamed: 0_level_1,sum,mean,min,max,count
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2,99069,1320.92,28,8999,75
3,67443,2043.727273,500,8999,33
4,29361,752.846154,79,2799,39
5,33138,736.4,99,3690,45
6,116349,923.404762,5,3999,126


-One aggregate per item that adds up the rest of the values.

In [101]:
per_item = df.groupby('item_id').agg({'item_price':['sum','mean','min','max','count'],
                                       'item_cnt_day':['sum']})
per_item.head()

Unnamed: 0_level_0,item_price,item_price,item_price,item_price,item_price,item_cnt_day
Unnamed: 0_level_1,sum,mean,min,max,count,sum
item_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
30,507,169.0,169,169,3,3
31,1089,363.0,363,363,3,3
32,447,149.0,149,149,3,3
42,897,299.0,299,299,3,3
59,747,249.0,249,249,3,3


### Part 4. New Tables

Write three tables in your local database:
- A table for the cleaned data.
- A table for the aggregate per store.
- A table for the aggregate per item.

In [108]:
# clean data
df.to_csv(r'C:\Users\Gareth\Desktop\Ironhack\Week_Three\lab-df-calculation-and-transformation\your-code\tables\data_clean.csv')

# agg per store
per_store.to_csv(r'C:\Users\Gareth\Desktop\Ironhack\Week_Three\lab-df-calculation-and-transformation\your-code\tables\data_per_store.csv')

# agg per item
per_item.to_csv(r'C:\Users\Gareth\Desktop\Ironhack\Week_Three\lab-df-calculation-and-transformation\your-code\tables\data_per_item.csv')