# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [128]:
# your code here
import pandas as pd
import csv

In [129]:
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
print(connection_string)

mysql+pymysql://data-students:iR0nH@cK-D4T4B4S3@34.65.10.136/retail_sales


In [130]:
engine = create_engine(connection_string)
print(engine)

Engine(mysql+pymysql://data-students:***@34.65.10.136/retail_sales)


In [131]:
data=pd.read_sql('SHOW TABLES;', engine)
print(data)

  Tables_in_retail_sales
0              raw_sales


In [132]:
table_raw = pd.read_sql('SELECT * FROM raw_sales', engine)
original=table_raw.copy()



In [133]:
table_raw

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
5,2015-01-04,28,21842,149.0,1.0
6,2015-01-04,28,21881,299.0,1.0
7,2015-01-04,29,6930,2199.0,1.0
8,2015-01-04,29,10515,169.0,1.0
9,2015-01-04,29,8624,149.0,1.0


In [134]:
table_raw.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

In [135]:
#float as a type is strange for item_cnt_day

In [136]:
table_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null datetime64[ns]
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 177.6 KB


In [137]:
# no nan values in the table

In [138]:
from sqlalchemy import create_engine

In [139]:
table_raw.groupby('item_id').count()
#makes sense to group them afterwards

Unnamed: 0_level_0,date,shop_id,item_price,item_cnt_day
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,3,3,3,3
31,3,3,3,3
32,3,3,3,3
42,3,3,3,3
59,3,3,3,3
74,3,3,3,3
109,3,3,3,3
259,3,3,3,3
464,3,3,3,3
482,6,6,6,6


In [140]:
driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'

In [141]:
table_raw.groupby('item_cnt_day').count()
#item cnt_day = -1 is wrong. so i ll make a 0 out of it instead


Unnamed: 0_level_0,date,shop_id,item_id,item_price
item_cnt_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1.0,30,30,30,30
1.0,4161,4161,4161,4161
2.0,264,264,264,264
3.0,42,42,42,42
4.0,30,30,30,30
5.0,9,9,9,9
6.0,6,6,6,6
10.0,3,3,3,3


In [142]:
table_raw.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
date            4545 non-null datetime64[ns]
shop_id         4545 non-null int64
item_id         4545 non-null int64
item_price      4545 non-null float64
item_cnt_day    4545 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 177.6 KB


In [143]:
#first make a int out of the float item_cnt_day
table_raw=table_raw.astype({'item_cnt_day':'int64'})
table_raw.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day             int64
dtype: object

In [144]:
table_raw.groupby('item_cnt_day').count()


Unnamed: 0_level_0,date,shop_id,item_id,item_price
item_cnt_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1,30,30,30,30
1,4161,4161,4161,4161
2,264,264,264,264
3,42,42,42,42
4,30,30,30,30
5,9,9,9,9
6,6,6,6,6
10,3,3,3,3


In [145]:
#change the value in the table
table_raw['item_cnt_day'].replace(-1, 0,inplace=True)

In [146]:
table_raw.groupby('item_cnt_day').count()


Unnamed: 0_level_0,date,shop_id,item_id,item_price
item_cnt_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,30,30,30,30
1,4161,4161,4161,4161
2,264,264,264,264
3,42,42,42,42
4,30,30,30,30
5,9,9,9,9
6,6,6,6,6
10,3,3,3,3


In [157]:
grouped_price=table_raw.groupby('item_price').count()
grouped_price=grouped_price.sort_values(by='item_price', ascending=False)
grouped_price

#we should get rid off the highest few values



Unnamed: 0_level_0,date,shop_id,item_id,item_cnt_day
item_price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
27990.0,3,3,3,3
27392.0,3,3,3,3
26990.0,9,9,9,9
25392.0,3,3,3,3
19990.0,3,3,3,3
14990.0,6,6,6,6
8999.0,9,9,9,9
6990.0,3,3,3,3
6799.0,3,3,3,3
5890.0,3,3,3,3


In [180]:
table_raw.item_price.quantile(.99)
#so we should get rid of the 1% of data that is almost 30'000

5500.0

In [182]:
table_raw=table_raw.astype({'item_price':'int64'})
table_raw.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price               int64
item_cnt_day             int64
dtype: object

In [209]:
cleaned=table_raw[table_raw.item_price<25000]
cleaned.head()


Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199,1
1,2015-01-04,28,21364,479,1
2,2015-01-04,28,21365,999,2
3,2015-01-04,28,22104,249,2
4,2015-01-04,28,22091,179,1


In [213]:
cleaned['shop_earnings']=cleaned.item_price*cleaned.item_cnt_day
cleaned.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,shop_earnings
0,2015-01-04,29,1469,1199,1,1199
1,2015-01-04,28,21364,479,1,479
2,2015-01-04,28,21365,999,2,1998
3,2015-01-04,28,22104,249,2,498
4,2015-01-04,28,22091,179,1,179


In [223]:
shop_aggregated=cleaned.groupby(['date','shop_id']).agg({'item_cnt_day':'sum', 'shop_earnings':'sum'})

In [224]:
shop_aggregated

Unnamed: 0_level_0,Unnamed: 1_level_0,item_cnt_day,shop_earnings
date,shop_id,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-04,2,81,103743
2015-01-04,3,33,67443
2015-01-04,4,39,29361
2015-01-04,5,45,33138
2015-01-04,6,150,138675
2015-01-04,7,63,52371
2015-01-04,10,30,22716
2015-01-04,12,216,295158
2015-01-04,14,51,57450
2015-01-04,15,93,125139


Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,shop_earnings
0,2015-01-04,29,1469,1199,1,1199
1,2015-01-04,28,21364,479,1,479
2,2015-01-04,28,21365,999,2,1998
3,2015-01-04,28,22104,249,2,498
4,2015-01-04,28,22091,179,1,179


In [226]:
item_aggregated=cleaned.groupby(['date','item_id']).agg({'item_cnt_day':'sum', 'shop_earnings':'sum'})

In [227]:
item_aggregated

Unnamed: 0_level_0,Unnamed: 1_level_0,item_cnt_day,shop_earnings
date,item_id,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-04,30,3,507
2015-01-04,31,3,1089
2015-01-04,32,3,447
2015-01-04,42,3,897
2015-01-04,59,3,747
2015-01-04,74,3,1497
2015-01-04,109,3,747
2015-01-04,259,3,747
2015-01-04,464,3,897
2015-01-04,482,12,39600
