# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [5]:
import pandas as pd
from sqlalchemy import create_engine

# Creating a function that creates connection to database.

def connect_to_database(connection_string):
    engine = create_engine(connection_string)
    return engine.connect()

driver   = 'mysql+pymysql:'
user     = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
ip       = '34.65.10.136'
database = 'retail_sales'

connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
conn = connect_to_database(connection_string)

retail_sales = pd.read_sql('SELECT * FROM raw_sales', conn)

In [7]:
retail_sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [8]:
# Analysing the data

retail_sales.describe()

# ANOMALIE 1: There are negative values in item_cnt_day (min = -1). 
# ANOMALIE 2: There is a hige difference between min and max price. 
 # 75% of the values have a max price of 1192, so it seams odd that the max price is almost 30k.

Unnamed: 0,shop_id,item_id,item_price,item_cnt_day
count,4545.0,4545.0,4545.0,4545.0
mean,34.021122,11140.459406,1031.686121,1.10363
std,16.565517,6558.649572,2073.91999,0.536967
min,2.0,30.0,3.0,-1.0
25%,22.0,4977.0,249.0,1.0
50%,31.0,11247.0,479.0,1.0
75%,50.0,16671.0,1192.0,1.0
max,59.0,22162.0,27990.0,10.0


In [60]:
# ANOMALIE 1: There are negative values in item_cnt_day (min = -1)

# How Many?
indx_neg_values = retail_sales.item_cnt_day[(retail_sales.item_cnt_day < 0)]
indx_neg_values.count()
### There are 30 negative values in the column. Are not many. 

30

In [48]:
# What to do with them?
### I would rather clear them than filled them as it seams like a typo mistake.

In [50]:
# How to clean them?
### Creating a function to automatically drop negative value on item_cnt_day

def drop_negative(retail_sales_new_data):                         
    return retail_sales.loc[retail_sales['item_cnt_day'] >= 0]

retail_sales_cleaned = drop_negative(retail_sales)
retail_sales_cleaned

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [52]:
# ANOMALIE 2: There is a hige difference between min and max price. 
 # 75% of the values have a max price of 1192, so it seams odd that the max price is almost 30k.

# How many values there are for each price point?
retail_sales_cleaned.item_price.value_counts()

399.00     345
299.00     294
99.00      243
199.00     189
349.00     183
          ... 
70.00        3
248.00       3
721.55       3
989.00       3
2598.00      3
Name: item_price, Length: 221, dtype: int64

In [72]:
retail_sales_cleaned.item_price.min()

3.0

In [73]:
# How many values there are for each price range?
retail_sales_cleaned.item_price.value_counts(bins=5).sort_index()

# ANOMALIE 2.1: From the previous analysis we found negative prices. 

(-24.988, 5600.4]     4470
(5600.4, 11197.8]       18
(11197.8, 16795.2]       6
(16795.2, 22392.6]       3
(22392.6, 27990.0]      18
Name: item_price, dtype: int64

In [61]:
retail_sales_cleaned

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
...,...,...,...,...,...
4540,2015-01-04,15,4240,1299.0,1.0
4541,2015-01-04,14,21922,99.0,1.0
4542,2015-01-04,15,1969,3999.0,1.0
4543,2015-01-04,14,22091,179.0,1.0


In [68]:
#ANOMALIE 2.1: From the previous analysis we found negative prices. 
# How many? (include price =0)
indx_neg_prices = retail_sales_cleaned.loc[(retail_sales_cleaned.item_price <= 0)]
indx_neg_prices

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day


In [58]:
indx_neg_values = retail_sales.item_cnt_day[(retail_sales.item_cnt_day < 0)]
indx_neg_values.count()

30

In [None]:
# What to do with them?
## I would delete the entries. A negative item price makes no sense. Must be a typo.


In [25]:
df = pd.read_sql('SHOW TABLES', engine)

In [20]:
# Creating aggregate per store that adds up the rest of the values.

sales_by_store = raw_sales.groupby("shop_id")

avg_by_store = raw_sales.groupby("shop_id").agg({"mean"})
avg_by_store.head()

Unnamed: 0_level_0,item_id,item_price,item_cnt_day
Unnamed: 0_level_1,mean,mean,mean
shop_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2,12891.72,1320.94,1.08
3,10174.090909,2043.727273,1.0
4,12785.230769,752.846154,1.0
5,13797.066667,736.4,1.0
6,10054.714286,923.428571,1.190476


In [21]:
# Creating aggregate per item that adds up the rest of the values.

sum_by_item = raw_sales.groupby("item_id").agg({"sum"})
sum_by_item.head()

Unnamed: 0_level_0,shop_id,item_price,item_cnt_day
Unnamed: 0_level_1,sum,sum,sum
item_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
30,84,507.0,3.0
31,18,1089.0,3.0
32,93,447.0,3.0
42,162,897.0,3.0
59,171,747.0,3.0


In [None]:
# Creating the tables in your local database.
avg_by_store.to_sql(Avg_by_Store, conn, index=False) [path to local database]


In [24]:
# Populating them with your process.















NameError: name 'cwd' is not defined