# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

#### Libraries

In [152]:
import sqlalchemy
import pandas as pd
import pymysql
import re

#### mysql engine to set the connection to the server

In [153]:
conn_string = 'mysql+pymysql://data-students:iR0nH@cK-D4T4B4S3@34.65.10.136/retail_sales'

conn = sqlalchemy.create_engine(conn_string)

In [154]:
conn

Engine(mysql+pymysql://data-students:***@34.65.10.136/retail_sales)

In [155]:
#sales = pd.read_sql_query('SELECT * FROM retail_sales.raw_sales;', conn)

In [156]:
#A table for the cleaned data.
#A table for the aggregate per store.
#A table for the aggregate per item.

In [166]:
df = pd.read_csv('/Users/andressalomferrer/Desktop/raw_sales.csv')

### Cleaning the raw data

In [167]:
# firts I want to see hoy my data looks like: 
#(once I know how it looks like I'll run df.head() so it does not take so much space)
df.head(2)

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04 00:00:00,29,1469,1199.0,1
1,2015-01-04 00:00:00,28,21364,479.0,1


In [170]:
#Now I want to see the type and shape of my database
df.dtypes

date             object
shop_id           int64
item_id           int64
item_price      float64
item_cnt_day      int64
dtype: object

In [171]:
df.shape

(4545, 5)

In [91]:
#First thing I want to clean is the time of the purchase. I dont want the exact hour to appear in my df, with the year, month and day it will be enough:
df['date'] = df.date.str.slice(0,11)

In [94]:
df[0:3]

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1
1,2015-01-04,28,21364,479.0,1
2,2015-01-04,28,21365,999.0,2


In [97]:
#now I want to check if there is any null value. As we see there is no null values, so we can keep moving.
df.isnull().sum()

date            0
shop_id         0
item_id         0
item_price      0
item_cnt_day    0
dtype: int64

In [101]:
#From my point of view, the data is clean, so I won't make any more changes.
df[0:3]

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1
1,2015-01-04,28,21364,479.0,1
2,2015-01-04,28,21365,999.0,2


### Aggregate per store

In [120]:
#Create a new column to the original df that allows us to know the revenue of the store. 
#Price per item * q sold

df['Revenue_store'] = df['item_price']*df['item_cnt_day']

In [137]:
df[0:3]

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,Revenue_store
0,2015-01-04,29,1469,1199.0,1,1199.0
1,2015-01-04,28,21364,479.0,1,479.0
2,2015-01-04,28,21365,999.0,2,1998.0


In [138]:
#Now we want to create a new data frame that allows us to see the revenue per store. But we still neew to group by store to see the final revenue per store.
new_df = df.loc[:,['shop_id','item_cnt_day','Revenue_store']]

In [141]:
new_df[0:3]

Unnamed: 0,shop_id,item_cnt_day,Revenue_store
0,29,1,1199.0
1,28,1,479.0
2,28,2,1998.0


In [185]:
new_df.groupby(['shop_id']).sum().sort_values('Revenue_store', ascending = False)[0:5]

Unnamed: 0_level_0,item_cnt_day,Revenue_store
shop_id,Unnamed: 1_level_1,Unnamed: 2_level_1
42,249,330111.0
31,402,304692.0
12,216,295173.0
25,312,288432.0
21,180,228999.0


### Aggregate per item

In [178]:
df['Revenue_item'] = df['item_price']*df['item_cnt_day']

In [179]:
df[0:3]

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,Revenue_item
0,2015-01-04 00:00:00,29,1469,1199.0,1,1199.0
1,2015-01-04 00:00:00,28,21364,479.0,1,479.0
2,2015-01-04 00:00:00,28,21365,999.0,2,1998.0


In [180]:
df_item = df.loc[:,['item_id','item_cnt_day','Revenue_item']]

In [186]:
df_item.groupby(['item_id']).sum().sort_values('Revenue_item', ascending = False)[0:5]

Unnamed: 0_level_0,item_cnt_day,Revenue_item
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1969,66,262134.0
6675,9,242910.0
1971,27,121473.0
1970,12,107988.0
13494,6,89940.0
