## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

# Exercise

In [1]:
import pandas as pd
import numpy as np


## Read the sample file that a daily process will save in your folder.  

In [2]:
retail_sales_raw = pd.read_csv(filepath_or_buffer=("../data/retail_sales-raw_sales.csv"),sep=";")


## Create a copy that I can go back if something goes wrong

In [3]:

retail_sales_raw_copy = retail_sales_raw.copy()

In [4]:
#show the first rows:
#Realize, separator was ";" change the read_csv line
retail_sales_raw.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04 00:00:00,29,1469,1199.0,1.0
1,2015-01-04 00:00:00,28,21364,479.0,1.0
2,2015-01-04 00:00:00,28,21365,999.0,2.0
3,2015-01-04 00:00:00,28,22104,249.0,2.0
4,2015-01-04 00:00:00,28,22091,179.0,1.0


## Explore the dataframe

In [5]:

retail_sales_raw.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          4545 non-null   object 
 1   shop_id       4545 non-null   int64  
 2   item_id       4545 non-null   int64  
 3   item_price    4545 non-null   float64
 4   item_cnt_day  4545 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 177.7+ KB


In [6]:
#keep exploring:

In [7]:
len(retail_sales_raw)

4545

In [8]:
retail_sales_raw['date'].value_counts()

2015-01-04 00:00:00    4545
Name: date, dtype: int64

In [9]:
#another way to check the number of non unique values in date column
retail_sales_raw['date'].nunique()

1

In [10]:
retail_sales_raw['date'].unique()

array(['2015-01-04 00:00:00'], dtype=object)

 Conclussion 1: There are 4545 rows in total, all with the same date (date an hour:"2015-01-04 00:00:00")
 
 dates are probably created from the automatic process, time is not important, date is.
 
 the type is not datatype

In [11]:
retail_sales_raw.tail()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
4540,2015-01-04 00:00:00,15,4240,1299.0,1.0
4541,2015-01-04 00:00:00,14,21922,99.0,1.0
4542,2015-01-04 00:00:00,15,1969,3999.0,1.0
4543,2015-01-04 00:00:00,14,22091,179.0,1.0
4544,2015-01-04 00:00:00,15,1007,1199.0,1.0


In [12]:
#there are many different shops.
#Shops 31, 57 and 25 are the best sellers,
#in terms of individual sells, not taking price into account.
#shops 10,51 and 34 are the worst
retail_sales_raw['shop_id'].value_counts()

31    345
57    309
25    294
42    240
28    201
54    186
21    174
27    162
58    153
12    144
6     126
26    108
22    108
55    105
59    102
46     96
15     93
18     87
35     84
47     78
2      75
52     75
56     72
53     72
29     72
24     69
38     66
44     66
45     66
19     66
16     63
7      63
48     63
50     60
37     60
39     51
14     45
5      45
4      39
49     33
3      33
41     30
10     27
51     21
34     18
Name: shop_id, dtype: int64

In [13]:
#There are 45 shops:
retail_sales_raw['shop_id'].nunique()

45

In [14]:
#There are 45 shops, another way to check:
len(retail_sales_raw['shop_id'].value_counts())

45

In [15]:
retail_sales_raw["price_per_sell"]=retail_sales_raw["item_price"]*retail_sales_raw["item_cnt_day"]

In [16]:
retail_sales_modif=retail_sales_raw

In [17]:
retail_sales_modif

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,price_per_sell
0,2015-01-04 00:00:00,29,1469,1199.0,1.0,1199.0
1,2015-01-04 00:00:00,28,21364,479.0,1.0,479.0
2,2015-01-04 00:00:00,28,21365,999.0,2.0,1998.0
3,2015-01-04 00:00:00,28,22104,249.0,2.0,498.0
4,2015-01-04 00:00:00,28,22091,179.0,1.0,179.0
...,...,...,...,...,...,...
4540,2015-01-04 00:00:00,15,4240,1299.0,1.0,1299.0
4541,2015-01-04 00:00:00,14,21922,99.0,1.0,99.0
4542,2015-01-04 00:00:00,15,1969,3999.0,1.0,3999.0
4543,2015-01-04 00:00:00,14,22091,179.0,1.0,179.0


In [18]:
#create a simple histogram with the shop ids.
#this histgram says nothing bc it has the wrong axis:
#(retail_sales_raw['shop_id'].value_counts()).hist(axis=1)

In [19]:
# I should do something automaticcaly if there were null values
#if something is.null remove or print something

In [20]:
#save after the changes (that I have not done yet) in a csv separated by ","
retail_sales_modif.to_csv(path_or_buf=("../data/retail_sales-modif.csv"),sep=",")

In [21]:
#agregate by shop id and item_cnt_day, to see what shops sell the most
shops_df_item_cnt=retail_sales_raw.groupby('shop_id').agg({'item_cnt_day': 'sum'})
shops_df_item_cnt.sort_values(by=['item_cnt_day'], ascending=False)

Unnamed: 0_level_0,item_cnt_day
shop_id,Unnamed: 1_level_1
31,402.0
57,324.0
25,312.0
42,249.0
28,246.0
12,216.0
54,195.0
55,180.0
27,180.0
21,180.0


In [22]:
#agregate by item price to see what shops sell the most expensive items
shops_df_item_price=retail_sales_raw.groupby('shop_id').agg({'item_price': 'sum'})
shops_df_item_price.sort_values(by=['item_price'], ascending=False).head(3)

Unnamed: 0_level_0,item_price
shop_id,Unnamed: 1_level_1
42,327864.0
25,281796.0
31,268098.0


In [28]:
#I added this list because to know the amount of money made by each shop
# aggregate by shop and add the new column of price per sell to know the shops that make the most
shops_df_total=retail_sales_raw.groupby('shop_id').agg({'price_per_sell': 'sum'})
shops_df_total.sort_values(by=['price_per_sell'], ascending=False).head(3)

Unnamed: 0_level_0,price_per_sell
shop_id,Unnamed: 1_level_1
42,330111.0
31,304692.0
12,295173.0


In [29]:
shops_df.head(3)

NameError: name 'shops_df' is not defined

In [30]:
#save after the table of the shops and the selled intems counts, orderd from descendingly
#in a csv separated by ","
shops_df_item_cnt.to_csv(path_or_buf=("../data/sales_item_count.csv"),sep=",")

In [31]:
#save after the changes (that I have not done yet) in a csv separated by ","
shops_df_item_price.to_csv(path_or_buf=("../data/sales_item_price.csv"),sep=",")

In [32]:
#save after the changes (that I have not done yet) in a csv separated by ","
shops_df_total.to_csv(path_or_buf=("../data/sales_total.csv"),sep=",")