<a name="top"></a>Contents
===
- [Introduction](#intro)

- [Libraries](#libraries)

- [Functions](#functions)

- [Datetime Dataframe](#datetime)

- [Pageviews](#pageviews)
    - [Open Pageviews dataset](#open_pageviews)
    - [Customize and save Pageviews dataset](#customize_pageviews)
    
- [BW](#bw)
    - [Open BW dataset](#open_bw)
    - [Customize and save BW dataset](#customize_bw)

    
- [Adwords](#adwords)
    - [Open Adwords dataset](#open_adwords)
    - [Customize Adwords dataset](#customize_adwords)
    - [Merge Adwords dataset](#merge_ads)
    - [Save Adwords dataset](#save_ads)    

    
- [Final dataset](#final)
    - [Joining dataset](#join)
    - [Rolling dataset](#rolling)
    - [Concat and save Final Dataset](#save_final)
   

-----------------------------------------------------------------------------------------
<a name='intro'></a>
# Introduction

The main purpose of this script is to obtain all the relevant information to develop a machine learning model to optimize the profitability of shopping campaigns.

During this notebook, a process is developed to obtain such quality information related to the previously selected references using several data sources.

------------------------------------------------------------------------------------------
<a name='libraries'></a>
# Libraries
First of all, it is necessary to import the required libraries to develop all the steps correctly of this notebook.

In [1]:
#import libraries
import os
import glob
import pandas as pd
import numpy as np
from datetime import datetime


#import filter warnings
import warnings
warnings.filterwarnings('ignore')


#display a maximum of 500 columns and rows
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)

#set directory of work
path = '/home/miguel/my_project_python/my_project_env/tfm/shopping'
os.chdir(path)

-----------------------------------------------------------------------------------------------
<a name='functions'></a>
# Functions
In this chapter is run the functions used in this notebook.

In [2]:
#OPEN_BW: load bw dataset
def open_bw(file, path_file):
    
    #open the file xlsx    
    df_bw = pd.read_csv(path_file + file, delimiter = ';', error_bad_lines=False)
    
    #rename the file columns correctly
    df_bw = df_bw.rename(index=str, columns={'Referencia':'Reference', 'T Día natural':'Date', 
                                             'Valor neto pedidos':'Net_Incomes', 
                                             'Cantidad en unidades (pedidos)':'Units_sold'})
        
    #replace
    df_bw['Net_Incomes'] = df_bw['Net_Incomes'].str.replace('.','').str.replace(',','.').astype(float)
    
    #date columns to specific datetime format
    df_bw['Date'] = pd.to_datetime(df_bw['Date'], format = '%d.%m.%Y')
    
    return df_bw



#LIST_FUNCTION: create a list from a column
def list_function(df):
    
    #series from int to string
    lista = df.apply(str)
    
    #string to list
    c = lista.values.tolist()
    
    return c



#LOOP_FUNCTION: create a loop for saving data
def loop_function(df, c, path_save, path_data1 = '', path_data2 = '', path_data3 = '', file_format = '.csv'):

    #conditional to select the right function
    if df == pageview_missing_values:
        #loop
        for i in range(len(c)):
            df_loop = df(path_data1, c[i])
            file = c[i]
            df_loop.to_csv(path_save + file + file_format)
        return print('Applied correctly function: pageview_missing_values')
    
    elif df == bw_missing_values:
        #loop
        for i in range(len(c)):
            df_loop = bw_missing_values(path_data1, path_data2, c[i])
            file = c[i]
            df_loop.to_csv(path_save + file + file_format)
        return print('Applied correctly function: bw_missing_values')
    
    elif df == ads_missing_values:
        #loop
        for i in range(len(c)):
            df_loop = ads_missing_values(path_data1, path_data2, c[i])
            file = c[i]
            df_loop.to_csv(path_save + file + file_format)
        return print('Applied correctly function: ads_missing_values')

    elif df == join_dataset:
        #loop
        for i in range(len(c)):
            df_loop = join_dataset(c[i], path_data1, path_data2, path_data3)
            file = c[i]
            df_loop.to_csv(path_save + file + file_format)
        return print('Applied correctly function: join_dataset')
    
    elif df == rolling_dataset:
        #loop
        for i in range(len(c)):
            df_loop = rolling_dataset(path_data1, c[i])
            file = c[i]  
            df_loop.to_csv(path_save + file + file_format)
        return print('Applied correctly function: rolling_dataset')
        
    else:       
        #loop
        for i in range(len(c)):
            df_loop = DataFrameDict[c[i]]
            file = c[i]
            df_loop.to_csv(path_save + file + file_format)
        return print('Function applied correctly')


    
#PAGE_MISSING_VALUES: customize the dataset
def pageview_missing_values(path, file):
    
    #read csv
    df_pageview = pd.read_csv(path + file + '_url.csv')
    
    #rename columns
    df_pageview = df_pageview.rename(index=str, columns={'Day Index':'Dates', 'Page Views': 'Page_Views'})
    
    #select columns of interest
    column_interest_pageview = ['Page', 'Dates', 'Page_Views']
    df_pageview = df_pageview[column_interest_pageview]
    
    #parse dates    
    df_pageview['Dates'] = pd.to_datetime(df_pageview['Dates'], format = '%d/%m/%Y')
    
    #merge dataframes    
    merge_pageview = df_pageview.merge(df_datetime,
        how='right',
        left_on=['Dates'],
        right_on=['Dates'])
    
    #fill values    
    merge_pageview['Page'].fillna(method='ffill', inplace = True)
    merge_pageview['Page_Views'].fillna(0, inplace = True)
    
    #merge url
    merge_url = merge_pageview.merge(df_reference_url,
        how='inner',
        left_on=['Page'],
        right_on=['URL'])
    
    #select columns
    column_interest_reference = ['Reference', 'Dates', 'Page_Views']
    merge_url = merge_url[column_interest_reference].sort_values(by='Dates')
        
    return merge_url



#BW_MISSING_VALUES: customize the dataset
def bw_missing_values(path1, path2, file):
    
    #set directory and open datetime_df file
    #os.chdir(path1)
    df_datetime = pd.read_csv(path1 + 'datetime_df.csv')
    
    #selecto column interest and parse data to datetime
    column_interest_datetime = ['Dates']
    df_datetime = df_datetime[column_interest_datetime]
    df_datetime['Dates'] = pd.to_datetime(df_datetime['Dates'], format = '%Y-%m-%d')

    #open the csv file for each reference
    df_bw_ref = pd.read_csv(path2 + file + '.csv') #c[i]
        
    #data wranggling of this data        
    df_bw_ref['Date'] = pd.to_datetime(df_bw_ref['Date'], format = '%Y-%m-%d')
    df_bw_ref = df_bw_ref.rename(index=str, columns={'Net Incomes': 'Net_Incomes', })
        
    #select columns 
    column_interest_bw = ['Reference', 'Name', 'Date', 'Net_Incomes', 'Units_sold']
    df_bw_ref = df_bw_ref[column_interest_bw]
        
    #merge with df_datetime
    merge_bw = df_bw_ref.merge(df_datetime,
        how='right',
        left_on=['Date'],
        right_on=['Dates'])
        
    #fill NaN values
    merge_bw['Reference'].fillna(method='ffill', inplace = True)
    merge_bw['Name'].fillna(method='ffill', inplace = True)
    merge_bw.fillna(0, inplace = True)
    
    #sort values by Dates
    merge_bw = merge_bw.sort_values(by='Dates')
    
    #select required columns
    column_interest_merge = ['Reference', 'Name', 'Dates', 'Net_Incomes', 'Units_sold']
    merge_bw = merge_bw[column_interest_merge]   
 
    return merge_bw



#ADS_MISSING_VALUES: customize the dataset
def ads_missing_values(path1, path2, file):
    
    #set directory and open datetime_df file
    #os.chdir(path1)
    df_datetime = pd.read_csv(path1 + 'datetime_df.csv')
    
    #selecto column interest and parse data to datetime
    column_interest_datetime = ['Dates']
    df_datetime = df_datetime[column_interest_datetime]
    df_datetime['Dates'] = pd.to_datetime(df_datetime['Dates'], format = '%Y-%m-%d')
    
    #open the csv file for each reference
    df_ads_ref = pd.read_csv(path2 + file + '.csv') #quizas debo poner file en vez de c[i]
        
    #data wranggling of this data        
    df_ads_ref['Date'] = pd.to_datetime(df_ads_ref['Date'], format = '%Y-%m-%d')
     
    #merge with df_datetime
    merge_bw = df_ads_ref.merge(df_datetime,
        how='right',
        left_on=['Date'],
        right_on=['Dates'])
        
    #fill NaN values
    merge_bw['Reference'].fillna(method='ffill', inplace = True)
    merge_bw['CatN1'].fillna(method='ffill', inplace = True) #error está aqui
    merge_bw['CatN2'].fillna(method='ffill', inplace = True) #error está aqui
    merge_bw['Cat_Price'].fillna(method='ffill', inplace = True) #error está aqui
    merge_bw.fillna(0, inplace = True)
    
    #sort values by Dates
    merge_bw = merge_bw.sort_values(by='Dates')
    
    #select required columns
    merge_bw = merge_bw.drop(['Date', 'Unnamed: 0'], axis=1)
    
    #drop duplicates in 'Dates'
    merge_bw['Dates'] = merge_bw['Dates'].drop_duplicates(keep='first')

        
    return merge_bw



#JOIN_DATASET: customize the dataset
def join_dataset(file, path_ads, path_bw, path_page, file_format = '.csv'):
    
    #set directory to open the adwords dataset
    df_ads = pd.read_csv(path_ads + file + file_format, sep=',', error_bad_lines=True)
        
    
    #set directory to open the bw dataset
    #os.chdir(path_bw)
    df_bw = pd.read_csv(path_bw + file + file_format, sep=',', error_bad_lines=True)
        
    
    #set directory to open the page dataset
    df_page = pd.read_csv(path_page + file + file_format, sep=',', error_bad_lines=True)
        
    #merge data
    df_final = df_ads.merge(df_bw,on='Dates').merge(df_page,on='Dates')
    
    return df_final



#ROLLING_DATASET: apply rolling
def rolling_dataset(path, file, file_format = '.csv'):
    
    #open the csv file for each reference
    df_rolling = pd.read_csv(path + file + file_format, sep=',', error_bad_lines=True)

    #select columns
    columns_interest_df = ['Reference', 'CatN1', 'CatN2', 'Cat_Price', 'Dates', 'CPC_medio', 'Impressions', 
                           'Clics', 'Page_Views', 'Cost', 'Conversions', 'All_Conversions', 'Ads_Income', 
                           'Ads_Income_All', 'Net_Incomes', 'Units_sold', 'ROAS_Ads']
    df_rolling = df_rolling[columns_interest_df]
    
    #create CTR column
    df_rolling['CTR'] = df_rolling['Clics'] / df_rolling['Impressions']

    #rolling
    #select columns to rolling
    columns = ['CPC_medio', 'Impressions', 'Clics', 'CTR', 'Page_Views', 'Cost', 'Conversions', 
                       'All_Conversions', 'Ads_Income', 'Ads_Income_All', 'Net_Incomes', 'Units_sold']

    #for loop to apply columns
    for i in range(len(columns)):
        column = columns[i]

    #columns created with rolling
        column1 = columns[i]+'_1w'
        column2 = columns[i]+'_2w'
        column3 = columns[i]+'_3w'
        column4 = columns[i]+'_4w'
            
    #apply rolling in each column
        df_rolling[column1] = round(df_rolling[column].transform(lambda x: x.rolling(7, 1).mean()),2)
        df_rolling[column2] = round(df_rolling[column].transform(lambda x: x.rolling(14, 1).mean()),2)
        df_rolling[column3] = round(df_rolling[column].transform(lambda x: x.rolling(21, 1).mean()),2)
        df_rolling[column4] = round(df_rolling[column].transform(lambda x: x.rolling(28, 1).mean()),2)     
        
    #filter date period
        df_rolling = df_rolling[df_rolling['Dates'] > '2018-09-30']
        

    return df_rolling

---------------------------------------------------------------------------------------------------
<a name='datetime'></a>  
# Datetime Dataframe

An inconvenient of the data selected for this study is "missing date values". 

**¿What does "missing date values" mean?**
<br>It means that there is no information saved in dataset when a reference has not had a transaction for a single date of our period of time selected. 


**¿Is it possible to develop the project without completing these missing values?**
<br>No, because it is necessary to merge two datasets by this key dimension.

**¿How to solve this issue?**
<br>It is created a dataframe for a specific period of time and save it as csv with the purpose of using it to merge the data in a proper way.

In [3]:
#run the python script
%run -i './notebook/date_dataframe.py'

#check the file
#set path
path = '/home/miguel/my_project_python/my_project_env/tfm/shopping'
os.chdir(path)

#load the file
df_datetime = pd.read_csv('./data/raw/datetime/datetime_df.csv')    

#select columns
column_interest_datetime = ['Dates']
df_datetime = df_datetime[column_interest_datetime]

#parse datetime
df_datetime['Dates'] = pd.to_datetime(df_datetime['Dates'], format = '%Y-%m-%d')

In [4]:
df_datetime.head()

Unnamed: 0,Dates
0,2018-09-03
1,2018-09-04
2,2018-09-05
3,2018-09-06
4,2018-09-07


In [5]:
df.shape

(210, 1)

As we can see, it is correct the information created!

----------------------------------------------------------------------------------------------------
<a name='pageviews'></a>

# Pageviews
In this chapter, it is merged all datasets link to pageviews information in order to create for each top reference a csv file with the following data:
    - Reference
    - Date
    - Pageviews

<a name='open_pageviews'></a>
## Open Pageviews dataset


Firstly, it is read the file required:

In [6]:
#select directory
path_reference_url = './data/customize/top_50/'

#load files
df_reference_url = pd.read_csv(path_reference_url + 'references_url.csv', sep='\t') 

#select columns
column_interest_reference = ['Reference', 'URL']
df_reference_url = df_reference_url[column_interest_reference]

I check the data

In [7]:
df_reference_url.head(5)

Unnamed: 0,Reference,URL
0,29300,/lampara-mesita-de-noche-oriental-blanco-ceram...
1,30655,/banqueta-pie-de-cama-romantico-blanco-metal-d...
2,38697,/belen-navidad-moderno-multicolor-resina-decor...
3,48329,/mesita-auxiliar-industrial-blanco-metal-salon...
4,48624,/biombo-plegable-oriental-blanco-madera-salon-...


<a name='customize_pageviews'></a>
## Customize and save Pageviews dataset

After checking previous dataframe, it is run the function 'pageview_missing_values' to cover the following steps:

- Open a csv file as dataframe
- Clean data from that file (rename columns, parse to datetime, etc.)
- Merge that dataframe with other dataframes
- Fill missing values
- Return the final dataframe

Firstly, it is applied list_function to df_reference_url

In [8]:
c = list_function(df_reference_url['Reference'])

Then, run loop function to apply all steps and save data :

In [9]:
loop_function(pageview_missing_values, c, './data/customize/pageviews/', './data/raw/pageview/')

Applied correctly function: pageview_missing_values


It is check the information gathered

In [10]:
#load file example
pageview_example = pd.read_csv('./data/customize/pageviews/123839.csv', sep=',', error_bad_lines=True, index_col=0)

#head of dataframe
pageview_example.head()

Unnamed: 0,Reference,Dates,Page_Views
0,123839,2018-09-03,9.0
118,123839,2018-09-04,0.0
1,123839,2018-09-05,9.0
2,123839,2018-09-06,9.0
119,123839,2018-09-07,0.0


-----------------------------------------------------------------------------------------------------------------
<a name='bw'></a>
# BW
In this chapter, it is merged all datasets 'BW' and 'datatime' in order to create for each top reference a csv file with the following data:
    - Reference
    - Name
    - Dates
    - Net_Incomes
    - Units_sold	

<a name='open_bw'></a>
## Open BW dataset
First, it is opened the bw dataset:

In [11]:
#open the file
df_bw = open_bw('BW.csv', './data/raw/bw/')

#head the file
df_bw.head(5)

Unnamed: 0,Reference,Name,Date,Net_Incomes,Units_sold
0,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-09-11,18.93,1
1,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-09-21,18.93,1
2,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-10-08,16.08,1
3,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-10-16,16.08,1
4,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-10-21,16.08,1


In [12]:
df_bw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37614 entries, 0 to 37613
Data columns (total 5 columns):
Reference      37614 non-null int64
Name           37614 non-null object
Date           37614 non-null datetime64[ns]
Net_Incomes    37614 non-null float64
Units_sold     37614 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 1.7+ MB


In [13]:
df_bw.shape

(37614, 5)

<a name='customize_bw'></a>
## Customize and save BW dataset

In this section, it is going to be developed following steps to create a csv file which contains BW information for each item of this study:
    - Create a dataframe dictionary
    - Read top 50 file and select column of interest
    - Run a loop to apply the function required for each reference
    - Check the information saved

In [14]:
#apply string format
df_bw['Reference'] = df_bw['Reference'].apply(str)

#create unique list of names
referencess = df_bw['Reference'].unique().tolist()

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in referencess}

#loop 
for key in DataFrameDict.keys():
    DataFrameDict[key] = df_bw[:][df_bw.Reference == key]

I use a reference id as example to check if it is working correctly

In [15]:
DataFrameDict['4623']

Unnamed: 0,Reference,Name,Date,Net_Incomes,Units_sold
0,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-09-11,18.93,1
1,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-09-21,18.93,1
2,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-10-08,16.08,1
3,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-10-16,16.08,1
4,4623,"ESPEJO PUERTA ""LCC"" BLANCO POLIETILENO",2017-10-21,16.08,1


After that, open top_50 dataset

In [16]:
#set directory
path = './data/customize/top_50/'

#open the file 'top_50'
df_top_50 = pd.read_csv(path + 'references_url.csv', sep='\t')

#select columns
column_interest_reference = ['Reference']
df_top_50 = df_top_50[column_interest_reference]

In [17]:
df_top_50.head()

Unnamed: 0,Reference
0,29300
1,30655
2,38697
3,48329
4,48624


In [18]:
df_top_50.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 1 columns):
Reference    50 non-null int64
dtypes: int64(1)
memory usage: 480.0 bytes


I create another list using last dataset (I apply string to "Reference" column as I have also made before)

In [19]:
c = list_function(df_top_50['Reference'])

Apply loop function for this case

In [20]:
loop_function(c, c, './data/customize/bw/')

Function applied correctly


Check the saved data

In [21]:
df_123839 = pd.read_csv('./data/customize/bw/123839.csv', sep=',', error_bad_lines=True, index_col=0)
df_123839.head()

Unnamed: 0,Reference,Name,Date,Net_Incomes,Units_sold
35260,123839,ESPEJO METAL ORO,2018-08-10,19.75,1
35261,123839,ESPEJO METAL ORO,2018-08-17,19.75,1
35262,123839,ESPEJO METAL ORO,2018-08-27,19.75,1
35263,123839,ESPEJO METAL ORO,2018-08-31,19.75,1
35264,123839,ESPEJO METAL ORO,2018-09-16,19.75,1


In [22]:
df_123839.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46 entries, 35260 to 35305
Data columns (total 5 columns):
Reference      46 non-null int64
Name           46 non-null object
Date           46 non-null object
Net_Incomes    46 non-null float64
Units_sold     46 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 2.2+ KB


After that, it is run the function "bw_missing_values":

In [23]:
loop_function(bw_missing_values, c, './data/customize/merge_bw_datetime/', './data/raw/datetime/', './data/customize/bw/')

Applied correctly function: bw_missing_values


Check the data:

In [24]:
df_123839_x = pd.read_csv('./data/customize/merge_bw_datetime/123839.csv', sep=',', error_bad_lines=True, index_col=0)
df_123839_x.head()

Unnamed: 0,Reference,Name,Dates,Net_Incomes,Units_sold
42,123839.0,ESPEJO METAL ORO,2018-09-03,0.0,0.0
43,123839.0,ESPEJO METAL ORO,2018-09-04,0.0,0.0
44,123839.0,ESPEJO METAL ORO,2018-09-05,0.0,0.0
45,123839.0,ESPEJO METAL ORO,2018-09-06,0.0,0.0
46,123839.0,ESPEJO METAL ORO,2018-09-07,0.0,0.0


-----------------------------------------------------------------------------------------------------------
<a name='adwords'></a>

# Adwords

In this chapter, it is developed with adwords data the following steps:
    - Load the required datasets.
    - Concatenate these datasets into one main dataset
    - Customize data (rename, select columns, etc.)
    - Select specific references and their data from previous dataset.
    - Split the main dataframe into multiple dataframes (a dataframe per reference)
    - Merge multiples dataframes with "datatime_df"
    - Save multiples dataframes correctly as csv files
    

<a name='open_adwords'></a>

## Open Adwords dataset

It is opened adwords data:
- TFM_Datos_2017_18_2.csv: advertising data from 05-02-2018 to 30-09-2018
- TFM_Datos_2018_19.csv: advertising data from 01-10-2018 to 30-03-2019

In [25]:
#format extension
extension = 'csv'

#find filenames
all_filenames = [i for i in glob.glob('./data/raw/adwords/*.{}'.format(extension))]

#combine all files in the list
df_adwords = pd.concat([pd.read_csv(f, sep=";", error_bad_lines=True) for f in all_filenames])

Check the dataframe

In [26]:
df_adwords.head(5)

Unnamed: 0,ID de producto,Campaña,ID de la campaña,Tipo de producto (primer nivel),Tipo de producto (segundo nivel),Día,Etiqueta personalizada 1,CPC máximo predeterminado del grupo de anuncios,Moneda,Impresiones,Clics,CTR,CPC medio,Coste,Conversiones,Todas las conversiones,Valor de conv.,Valor de todas las conversiones,Valor conv./coste
0,123624,Shop_Cocina y comedor_N1_Y18_W37,1559515409,cocina y comedor,tazas de café y mugs,12/10/2018,10,15,EUR,1,0,"0,00 %",0,0,0,0,0,0,0
1,107553,Shop_Smart_Lámparas_Y18_W45,1623248621,lámparas e iluminación,lámparas de techo,31/01/2019,40,1,EUR,2,0,"0,00 %",0,0,0,0,0,0,0
2,87810,Shop_Navidad_Estrellas de navidad_Y18_W12,1323457572,decoración de navidad,estrellas de navidad,17/10/2018,20,2,EUR,1,0,"0,00 %",0,0,0,0,0,0,0
3,105114,Shop_Decoración_N1_Y18_W37,1559890739,decoración para tu casa,figuras decorativas,25/10/2018,30,15,EUR,4,0,"0,00 %",0,0,0,0,0,0,0
4,111017,Shop_Navidad_Portavelas navideños_Y18_W12,1323682318,decoración de navidad,portavelas navideños,21/10/2018,20,2,EUR,8,0,"0,00 %",0,0,0,0,0,0,0


In [27]:
df_adwords.shape

(1026615, 19)

In [28]:
df_adwords.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1026615 entries, 0 to 579075
Data columns (total 19 columns):
ID de producto                                     1026615 non-null int64
Campaña                                            1026615 non-null object
ID de la campaña                                   1026615 non-null int64
Tipo de producto (primer nivel)                    1026615 non-null object
Tipo de producto (segundo nivel)                   1026615 non-null object
Día                                                1026615 non-null object
Etiqueta personalizada 1                           1026615 non-null object
CPC máximo predeterminado del grupo de anuncios    1026615 non-null object
Moneda                                             1026615 non-null object
Impresiones                                        1026615 non-null int64
Clics                                              1026615 non-null int64
CTR                                                1026615 non-null

<a name='customize_adwords'></a>

## Customize Adwords dataset

In this chapter, it is customized the dataset in order to:
    - Select columns of interest
    - Rename columns
    - Change data to correct format

Firstly, check the current columns of the dataset

In [29]:
df_adwords.columns

Index(['ID de producto', 'Campaña', 'ID de la campaña',
       'Tipo de producto (primer nivel)', 'Tipo de producto (segundo nivel)',
       'Día', 'Etiqueta personalizada 1',
       'CPC máximo predeterminado del grupo de anuncios', 'Moneda',
       'Impresiones', 'Clics', 'CTR', 'CPC medio', 'Coste', 'Conversiones',
       'Todas las conversiones', 'Valor de conv.',
       'Valor de todas las conversiones', 'Valor conv./coste'],
      dtype='object')

Then, customize the dataset

In [30]:
#rename columns
df_adwords = df_adwords.rename(index=str, 
            columns={'ID de producto':'Reference', 'Tipo de producto (primer nivel)': 'CatN1',
                     'Tipo de producto (segundo nivel)':'CatN2', 'Día':'Date', 
                     'Etiqueta personalizada 1': 'Cat_Price', 
                     'CPC máximo predeterminado del grupo de anuncios': 'CPC_max', 'CPC medio': 'CPC_medio', 
                     'Impresiones':'Impressions', 'Coste':'Cost', 'Conversiones': 'Conversions',
                    'Todas las conversiones': 'All_Conversions', 'Valor de conv.': 'Ads_Income', 
                    'Valor de todas las conversiones': 'Ads_Income_All', 'Valor conv./coste': 'ROAS_Ads'})


#select columns of interest
column_interest_adwords = ['Reference', 'CatN1', 'CatN2', 'Date', 'Cat_Price', 'CPC_medio', 'Impressions', 
                           'Clics', 'Cost', 'Conversions', 'All_Conversions', 'Ads_Income', 'Ads_Income_All',
                          'ROAS_Ads']
df_adwords = df_adwords[column_interest_adwords]

#Change reference value from int to string in order to save references id as list later
df_adwords['Reference'] = df_adwords['Reference'].apply(str)

#Parse time to 'Date' column
df_adwords['Date'] = pd.to_datetime(df_adwords['Date'], format = '%d/%m/%Y')

#Change to float type
#float_columns = ['CPC_medio', 'Cost', 'Conversions', 'All_Conversions', 'Ads_Income', 'Ads_Income_All']
#df_adwords['CPC_max'] = df_adwords['CPC_max'].str.replace(',','.').astype(float)
df_adwords['CPC_medio'] = df_adwords['CPC_medio'].str.replace(',','.').astype(float)
df_adwords['Cost'] = df_adwords['Cost'].str.replace(',','.').astype(float)
df_adwords['Conversions'] = df_adwords['Conversions'].str.replace(',','.').astype(float)
df_adwords['All_Conversions'] = df_adwords['All_Conversions'].str.replace(',','.').astype(float)
df_adwords['Ads_Income'] = df_adwords['Ads_Income'].str.replace(',','.').astype(float)
df_adwords['Ads_Income_All'] = df_adwords['Ads_Income_All'].str.replace(',','.').astype(float)
df_adwords['ROAS_Ads'] = df_adwords['ROAS_Ads'].str.replace('.','').str.replace(',','.').astype(float)

Check data again

In [31]:
df_adwords.head(5)

Unnamed: 0,Reference,CatN1,CatN2,Date,Cat_Price,CPC_medio,Impressions,Clics,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,ROAS_Ads
0,123624,cocina y comedor,tazas de café y mugs,2018-10-12,10,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0
1,107553,lámparas e iluminación,lámparas de techo,2019-01-31,40,0.0,2,0,0.0,0.0,0.0,0.0,0.0,0.0
2,87810,decoración de navidad,estrellas de navidad,2018-10-17,20,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0
3,105114,decoración para tu casa,figuras decorativas,2018-10-25,30,0.0,4,0,0.0,0.0,0.0,0.0,0.0,0.0
4,111017,decoración de navidad,portavelas navideños,2018-10-21,20,0.0,8,0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
df_adwords.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1026615 entries, 0 to 579075
Data columns (total 14 columns):
Reference          1026615 non-null object
CatN1              1026615 non-null object
CatN2              1026615 non-null object
Date               1026615 non-null datetime64[ns]
Cat_Price          1026615 non-null object
CPC_medio          1026615 non-null float64
Impressions        1026615 non-null int64
Clics              1026615 non-null int64
Cost               1026615 non-null float64
Conversions        1026615 non-null float64
All_Conversions    1026615 non-null float64
Ads_Income         1026615 non-null float64
Ads_Income_All     1026615 non-null float64
ROAS_Ads           1026615 non-null float64
dtypes: datetime64[ns](1), float64(7), int64(2), object(4)
memory usage: 117.5+ MB


In [33]:
df_adwords.shape

(1026615, 14)

<a name='merge_ads'></a>

## Merge Adwords dataset

It is merge adwords dataset with top_50 file by "Reference": this will reduce the size of the dataset and will also permit to split the information for each reference later

In [34]:
#df_top_50 to string
df_top_50['Reference'] = df_top_50['Reference'].apply(str)

#merge with adwords dataset
merge_adwords = df_adwords.merge(df_top_50,
    how='right',
    left_on=['Reference'],
    right_on=['Reference'])

In [35]:
merge_adwords.shape

(14627, 14)

In [36]:
reduction = round(df_adwords.shape[0]/merge_adwords.shape[0],0)

print("The dataset has been reduced by {} times".format(reduction))

The dataset has been reduced by 70.0 times


After simplifying the dataset, it is leaner splitting the dataset into multiple dataframes

In [37]:
#Create unique list of names
references = merge_adwords['Reference'].unique().tolist()

#Create a dataframe dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in references}

#Create a loop for to read all rows and save according to the key
for key in DataFrameDict.keys():
    DataFrameDict[key] = merge_adwords[:][merge_adwords.Reference == key]

Use a reference id as example to check its functionality

In [38]:
DataFrameDict['123839'].head()

Unnamed: 0,Reference,CatN1,CatN2,Date,Cat_Price,CPC_medio,Impressions,Clics,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,ROAS_Ads
11195,123839,decoración para tu casa,espejos de pared,2019-02-13,30,0.44,368,2,0.89,0.0,0.0,0.0,0.0,0.0
11196,123839,decoración para tu casa,espejos de pared,2019-03-30,30,0.28,421,4,1.13,0.0,0.0,0.0,0.0,0.0
11197,123839,decoración para tu casa,espejos de pared,2019-01-09,30,0.0,3,0,0.0,0.0,0.0,0.0,0.0,0.0
11198,123839,decoración para tu casa,espejos de pared,2019-01-21,30,0.34,541,8,2.75,0.01,0.01,0.72,0.72,0.26
11199,123839,decoración para tu casa,espejos de pared,2018-11-10,30,0.2,383,6,1.19,0.0,0.0,0.0,0.0,0.0


<a name='save_ads'></a>

## Save Adwords dataset

Apply loop_function to create the required dataframes per reference.

In [39]:
loop_function(c, c, './data/customize/adwords/')

Function applied correctly


Check the data created

In [40]:
df_test_ads = pd.read_csv('./data/customize/adwords/123839.csv', sep=',', error_bad_lines=True, index_col=0)

In [41]:
df_test_ads.head(5)

Unnamed: 0,Reference,CatN1,CatN2,Date,Cat_Price,CPC_medio,Impressions,Clics,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,ROAS_Ads
11195,123839,decoración para tu casa,espejos de pared,2019-02-13,30,0.44,368,2,0.89,0.0,0.0,0.0,0.0,0.0
11196,123839,decoración para tu casa,espejos de pared,2019-03-30,30,0.28,421,4,1.13,0.0,0.0,0.0,0.0,0.0
11197,123839,decoración para tu casa,espejos de pared,2019-01-09,30,0.0,3,0,0.0,0.0,0.0,0.0,0.0,0.0
11198,123839,decoración para tu casa,espejos de pared,2019-01-21,30,0.34,541,8,2.75,0.01,0.01,0.72,0.72,0.26
11199,123839,decoración para tu casa,espejos de pared,2018-11-10,30,0.2,383,6,1.19,0.0,0.0,0.0,0.0,0.0


In [42]:
df_test_ads.shape

(137, 14)

In [43]:
df_test_ads.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 137 entries, 11195 to 11331
Data columns (total 14 columns):
Reference          137 non-null int64
CatN1              137 non-null object
CatN2              137 non-null object
Date               137 non-null object
Cat_Price          137 non-null int64
CPC_medio          137 non-null float64
Impressions        137 non-null int64
Clics              137 non-null int64
Cost               137 non-null float64
Conversions        137 non-null float64
All_Conversions    137 non-null float64
Ads_Income         137 non-null float64
Ads_Income_All     137 non-null float64
ROAS_Ads           137 non-null float64
dtypes: float64(7), int64(4), object(3)
memory usage: 16.1+ KB


After that, apply the previous function using a for loop

In [44]:
loop_function(ads_missing_values, c, './data/customize/merge_ads_datetime/', './data/raw/datetime/', 
              './data/customize/adwords/')

Applied correctly function: ads_missing_values


Finally, I check the data obtained

In [45]:
df_123839 = pd.read_csv('./data/customize/merge_ads_datetime/123839.csv', sep=',', 
                        error_bad_lines=True, index_col = 0)

#head
df_123839.head(5)

Unnamed: 0,Reference,CatN1,CatN2,Cat_Price,CPC_medio,Impressions,Clics,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,ROAS_Ads,Dates
137,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-03
138,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-04
139,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-05
140,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-06
141,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-07


In [46]:
df_123839.shape

(211, 14)

In [47]:
df_123839.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211 entries, 137 to 126
Data columns (total 14 columns):
Reference          211 non-null float64
CatN1              211 non-null object
CatN2              211 non-null object
Cat_Price          211 non-null float64
CPC_medio          211 non-null float64
Impressions        211 non-null float64
Clics              211 non-null float64
Cost               211 non-null float64
Conversions        211 non-null float64
All_Conversions    211 non-null float64
Ads_Income         211 non-null float64
Ads_Income_All     211 non-null float64
ROAS_Ads           211 non-null float64
Dates              210 non-null object
dtypes: float64(11), object(3)
memory usage: 24.7+ KB


----------------------------------------------------------------------------------------------------------------
<a name='final'></a>

# Final Dataset

In this notebook, it is going to be merged all data created in previous chapters as one main dataset.

# PONER DIBUJO DE LO QUE VOY A HACER

<a name='join'></a>

## Joining Dataset

Join and save the dataset from BW, adwords and pageviews per each item

In [48]:
loop_function(join_dataset, c, './data/customize/merge_all/', './data/customize/merge_ads_datetime/', 
              './data/customize/merge_bw_datetime/', './data/customize/pageviews/')

Applied correctly function: join_dataset


Check the data

In [49]:
df_123839 = pd.read_csv('./data/customize/merge_all/123839.csv', sep=",", error_bad_lines=True)
df_123839.head(5)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0_x,Reference_x,CatN1,CatN2,Cat_Price,CPC_medio,Impressions,Clics,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,ROAS_Ads,Dates,Unnamed: 0_y,Reference_y,Name,Net_Incomes,Units_sold,Unnamed: 0.1,Reference,Page_Views
0,0,137,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-03,42,123839.0,ESPEJO METAL ORO,0.0,0.0,0,123839,9.0
1,1,138,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-04,43,123839.0,ESPEJO METAL ORO,0.0,0.0,118,123839,0.0
2,2,139,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-05,44,123839.0,ESPEJO METAL ORO,0.0,0.0,1,123839,9.0
3,3,140,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-06,45,123839.0,ESPEJO METAL ORO,0.0,0.0,2,123839,9.0
4,4,141,123839.0,decoración para tu casa,espejos de pared,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018-09-07,46,123839.0,ESPEJO METAL ORO,0.0,0.0,119,123839,0.0


In [50]:
df_123839.columns

Index(['Unnamed: 0', 'Unnamed: 0_x', 'Reference_x', 'CatN1', 'CatN2',
       'Cat_Price', 'CPC_medio', 'Impressions', 'Clics', 'Cost', 'Conversions',
       'All_Conversions', 'Ads_Income', 'Ads_Income_All', 'ROAS_Ads', 'Dates',
       'Unnamed: 0_y', 'Reference_y', 'Name', 'Net_Incomes', 'Units_sold',
       'Unnamed: 0.1', 'Reference', 'Page_Views'],
      dtype='object')

In [51]:
df_123839.shape

(210, 24)

In [52]:
df_123839.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 24 columns):
Unnamed: 0         210 non-null int64
Unnamed: 0_x       210 non-null int64
Reference_x        210 non-null float64
CatN1              210 non-null object
CatN2              210 non-null object
Cat_Price          210 non-null float64
CPC_medio          210 non-null float64
Impressions        210 non-null float64
Clics              210 non-null float64
Cost               210 non-null float64
Conversions        210 non-null float64
All_Conversions    210 non-null float64
Ads_Income         210 non-null float64
Ads_Income_All     210 non-null float64
ROAS_Ads           210 non-null float64
Dates              210 non-null object
Unnamed: 0_y       210 non-null int64
Reference_y        210 non-null float64
Name               210 non-null object
Net_Incomes        210 non-null float64
Units_sold         210 non-null float64
Unnamed: 0.1       210 non-null int64
Reference          210 non-

<a name='rolling'></a>

## Rolling Dataset

Apply rolling function with the purpose of adding the requires stational information for specific columns.

In this case, it has been applied adding data for rolling per week:
    - 1w: 1 week
    - 2w: 2 weeks
    - 3w: 3 weeks
    - 4w: 4 weeks

In [53]:
loop_function(rolling_dataset, c, './data/customize/rolling_dataset/', './data/customize/merge_all/')

Applied correctly function: rolling_dataset


Check the data for one item example

In [54]:
df_123839 = pd.read_csv('./data/customize/rolling_dataset/123839.csv', sep=',', error_bad_lines=True, index_col=0)
df_123839.head(5)

Unnamed: 0,Reference,CatN1,CatN2,Cat_Price,Dates,CPC_medio,Impressions,Clics,Page_Views,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,Net_Incomes,Units_sold,ROAS_Ads,CTR,CPC_medio_1w,CPC_medio_2w,CPC_medio_3w,CPC_medio_4w,Impressions_1w,Impressions_2w,Impressions_3w,Impressions_4w,Clics_1w,Clics_2w,Clics_3w,Clics_4w,CTR_1w,CTR_2w,CTR_3w,CTR_4w,Page_Views_1w,Page_Views_2w,Page_Views_3w,Page_Views_4w,Cost_1w,Cost_2w,Cost_3w,Cost_4w,Conversions_1w,Conversions_2w,Conversions_3w,Conversions_4w,All_Conversions_1w,All_Conversions_2w,All_Conversions_3w,All_Conversions_4w,Ads_Income_1w,Ads_Income_2w,Ads_Income_3w,Ads_Income_4w,Ads_Income_All_1w,Ads_Income_All_2w,Ads_Income_All_3w,Ads_Income_All_4w,Net_Incomes_1w,Net_Incomes_2w,Net_Incomes_3w,Net_Incomes_4w,Units_sold_1w,Units_sold_2w,Units_sold_3w,Units_sold_4w
28,123839,decoración para tu casa,espejos de pared,30.0,2018-10-01,0.27,156.0,4.0,0.0,1.08,0.0,0.0,0.0,0.0,37.53,2.0,0.0,0.025641,0.13,0.07,0.04,0.03,156.0,156.0,156.0,156.0,4.0,4.0,4.0,4.0,0.03,0.03,0.03,0.03,0.0,0.0,0.0,0.0,1.08,1.08,1.08,1.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.53,37.53,37.53,37.53,2.0,2.0,2.0,2.0
29,123839,decoración para tu casa,espejos de pared,30.0,2018-10-02,0.3,73.0,1.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013699,0.18,0.09,0.06,0.04,114.5,114.5,114.5,114.5,2.5,2.5,2.5,2.5,0.02,0.02,0.02,0.02,0.0,0.0,0.0,0.0,0.69,0.69,0.69,0.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.76,18.76,18.76,18.76,1.0,1.0,1.0,1.0
30,123839,decoración para tu casa,espejos de pared,30.0,2018-10-03,0.37,137.0,2.0,0.0,0.74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014599,0.23,0.12,0.08,0.06,122.0,122.0,122.0,122.0,2.33,2.33,2.33,2.33,0.02,0.02,0.02,0.02,0.0,0.0,0.0,0.0,0.71,0.71,0.71,0.71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.51,12.51,12.51,12.51,0.67,0.67,0.67,0.67
31,123839,decoración para tu casa,espejos de pared,30.0,2018-10-04,0.13,126.0,3.0,0.0,0.4,0.2,0.2,4.3,4.3,17.78,1.0,10.76,0.02381,0.22,0.12,0.08,0.06,123.0,123.0,123.0,123.0,2.5,2.5,2.5,2.5,0.02,0.02,0.02,0.02,0.0,0.0,0.0,0.0,0.63,0.63,0.63,0.63,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,1.08,1.08,1.08,1.08,1.08,1.08,1.08,1.08,13.83,13.83,13.83,13.83,0.75,0.75,0.75,0.75
32,123839,decoración para tu casa,espejos de pared,30.0,2018-10-05,0.27,264.0,3.0,19.0,0.8,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.011364,0.26,0.14,0.1,0.07,151.2,151.2,151.2,151.2,2.6,2.6,2.6,2.6,0.02,0.02,0.02,0.02,3.8,3.8,3.8,3.8,0.66,0.66,0.66,0.66,0.04,0.04,0.04,0.04,0.24,0.24,0.24,0.24,0.86,0.86,0.86,0.86,0.86,0.86,0.86,0.86,11.06,11.06,11.06,11.06,0.6,0.6,0.6,0.6


In [55]:
df_123839.shape

(182, 66)

<a name='save_final'></a>

## Concat and save Final Dataset
Finally, it is:
    - Concatenated all the files created in the previous section
    - Customize the dataframe
    - Save in the right folder

In [56]:
#format extension
extension = 'csv'

#find filenames
all_filenames = [i for i in glob.glob('./data/customize/rolling_dataset/*.{}'.format(extension))]

#combine all files in the list
df = pd.concat([pd.read_csv(f, sep=',', error_bad_lines=True, index_col=0) for f in all_filenames])

Check the data:

In [57]:
df.head()

Unnamed: 0,Reference,CatN1,CatN2,Cat_Price,Dates,CPC_medio,Impressions,Clics,Page_Views,Cost,Conversions,All_Conversions,Ads_Income,Ads_Income_All,Net_Incomes,Units_sold,ROAS_Ads,CTR,CPC_medio_1w,CPC_medio_2w,CPC_medio_3w,CPC_medio_4w,Impressions_1w,Impressions_2w,Impressions_3w,Impressions_4w,Clics_1w,Clics_2w,Clics_3w,Clics_4w,CTR_1w,CTR_2w,CTR_3w,CTR_4w,Page_Views_1w,Page_Views_2w,Page_Views_3w,Page_Views_4w,Cost_1w,Cost_2w,Cost_3w,Cost_4w,Conversions_1w,Conversions_2w,Conversions_3w,Conversions_4w,All_Conversions_1w,All_Conversions_2w,All_Conversions_3w,All_Conversions_4w,Ads_Income_1w,Ads_Income_2w,Ads_Income_3w,Ads_Income_4w,Ads_Income_All_1w,Ads_Income_All_2w,Ads_Income_All_3w,Ads_Income_All_4w,Net_Incomes_1w,Net_Incomes_2w,Net_Incomes_3w,Net_Incomes_4w,Units_sold_1w,Units_sold_2w,Units_sold_3w,Units_sold_4w
28,90953,muebles,"consolas, recibidores y tocadores",300.0,2018-10-01,0.0,0.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,28.0,28.0,28.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29,90953,muebles,"consolas, recibidores y tocadores",300.0,2018-10-02,0.0,0.0,0.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,23.5,23.5,23.5,23.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30,90953,muebles,"consolas, recibidores y tocadores",300.0,2018-10-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,15.67,15.67,15.67,15.67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31,90953,muebles,"consolas, recibidores y tocadores",300.0,2018-10-04,0.0,0.0,0.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,16.5,16.5,16.5,16.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32,90953,muebles,"consolas, recibidores y tocadores",300.0,2018-10-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,13.2,13.2,13.2,13.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9098 entries, 28 to 209
Data columns (total 66 columns):
Reference             9098 non-null int64
CatN1                 9098 non-null object
CatN2                 9098 non-null object
Cat_Price             9098 non-null float64
Dates                 9098 non-null object
CPC_medio             9098 non-null float64
Impressions           9098 non-null float64
Clics                 9098 non-null float64
Page_Views            9098 non-null float64
Cost                  9098 non-null float64
Conversions           9098 non-null float64
All_Conversions       9098 non-null float64
Ads_Income            9098 non-null float64
Ads_Income_All        9098 non-null float64
Net_Incomes           9098 non-null float64
Units_sold            9098 non-null float64
ROAS_Ads              9098 non-null float64
CTR                   6343 non-null float64
CPC_medio_1w          9098 non-null float64
CPC_medio_2w          9098 non-null float64
CPC_medio_3w      

In [59]:
df.shape

(9098, 66)

After concatenating the files, it is developed following steps in order to run the study with the rigth data:
    - Remove rows with ROAS_Ads == 0
    - Remove rows with Pageviews == 0
    - Remove rows with Impressions == 0
    - Remove columns "Reference" and "Dates"

In [60]:
#-----ROAS_ADS-------#

#remove rows with ROAS_Ads == 0
df = df[~df["ROAS_Ads"].isin([0, 1])]

#remove rows with Pageviews == 0
df = df[~df["Page_Views"].isin([0, 1])]

#remove rows with Impressions == 0
df = df[~df["Impressions"].isin([0, 1])]

#remove columns "Reference" and "Dates"
#sort columns in the correct order
columns = ['CatN1', 'CatN2', 'Cat_Price', 'CPC_medio', 'CPC_medio_1w', 'CPC_medio_2w', 'CPC_medio_3w', 
           'CPC_medio_4w','Impressions', 'Impressions_1w', 'Impressions_2w', 'Impressions_3w', 
           'Impressions_4w', 'Clics', 'Clics_1w', 'Clics_2w', 'Clics_3w', 'Clics_4w', 'CTR', 'CTR_1w', 
           'CTR_2w', 'CTR_3w', 'CTR_4w', 'Page_Views', 'Page_Views_1w', 'Page_Views_2w', 'Page_Views_3w', 
           'Page_Views_4w', 'Cost', 'Cost_1w', 'Cost_2w', 'Cost_3w', 'Cost_4w', 'Conversions', 'Conversions_1w', 
           'Conversions_2w', 'Conversions_3w', 'Conversions_4w', 'All_Conversions', 'All_Conversions_1w', 
           'All_Conversions_2w', 'All_Conversions_3w', 'All_Conversions_4w', 'Ads_Income', 'Ads_Income_1w', 
           'Ads_Income_2w', 'Ads_Income_3w', 'Ads_Income_4w', 'Ads_Income_All', 'Ads_Income_All_1w', 
           'Ads_Income_All_2w', 'Ads_Income_All_3w', 'Ads_Income_All_4w', 'Net_Incomes', 'Net_Incomes_1w', 
           'Net_Incomes_2w', 'Net_Incomes_3w', 'Net_Incomes_4w', 'Units_sold', 'Units_sold_1w', 'Units_sold_2w', 
           'Units_sold_3w', 'Units_sold_4w', 'ROAS_Ads']
df = df[columns]

#Cat_price to string
df['Cat_Price'] = df['Cat_Price'].apply(str)

Check the data

In [61]:
df.head(5)

Unnamed: 0,CatN1,CatN2,Cat_Price,CPC_medio,CPC_medio_1w,CPC_medio_2w,CPC_medio_3w,CPC_medio_4w,Impressions,Impressions_1w,Impressions_2w,Impressions_3w,Impressions_4w,Clics,Clics_1w,Clics_2w,Clics_3w,Clics_4w,CTR,CTR_1w,CTR_2w,CTR_3w,CTR_4w,Page_Views,Page_Views_1w,Page_Views_2w,Page_Views_3w,Page_Views_4w,Cost,Cost_1w,Cost_2w,Cost_3w,Cost_4w,Conversions,Conversions_1w,Conversions_2w,Conversions_3w,Conversions_4w,All_Conversions,All_Conversions_1w,All_Conversions_2w,All_Conversions_3w,All_Conversions_4w,Ads_Income,Ads_Income_1w,Ads_Income_2w,Ads_Income_3w,Ads_Income_4w,Ads_Income_All,Ads_Income_All_1w,Ads_Income_All_2w,Ads_Income_All_3w,Ads_Income_All_4w,Net_Incomes,Net_Incomes_1w,Net_Incomes_2w,Net_Incomes_3w,Net_Incomes_4w,Units_sold,Units_sold_1w,Units_sold_2w,Units_sold_3w,Units_sold_4w,ROAS_Ads
136,muebles,"consolas, recibidores y tocadores",200.0,0.26,0.06,0.03,0.02,0.02,39.0,36.71,18.93,12.62,9.46,4.0,0.86,0.43,0.29,0.21,0.102564,0.03,0.02,0.02,0.02,37.0,21.29,21.29,17.71,13.61,1.03,0.2,0.1,0.07,0.05,0.16,0.02,0.01,0.01,0.01,0.16,0.02,0.01,0.01,0.01,24.78,3.54,1.77,1.18,0.88,24.78,3.54,1.77,1.18,0.88,0.0,83.1,115.19,84.58,69.28,0.0,0.57,0.79,0.57,0.46,24.06
172,muebles,"consolas, recibidores y tocadores",200.0,0.26,0.36,0.34,0.34,0.31,451.0,278.0,280.79,235.86,206.5,3.0,1.43,1.79,1.67,1.54,0.006652,0.01,0.01,0.01,0.01,9.0,21.29,16.64,15.95,16.61,0.78,0.61,0.65,0.6,0.52,0.31,0.04,0.02,0.01,0.01,0.31,0.04,0.02,0.01,0.01,62.02,8.86,4.43,2.95,2.22,62.02,8.86,4.43,2.95,2.22,0.0,39.74,50.26,33.51,25.13,0.0,0.29,0.36,0.24,0.18,79.51
186,muebles,"consolas, recibidores y tocadores",200.0,0.3,0.46,0.42,0.41,0.41,824.0,739.0,631.86,500.62,454.07,7.0,5.0,5.14,4.14,3.64,0.008495,0.01,0.01,0.01,0.01,19.0,20.0,16.0,14.19,14.64,2.08,2.3,2.21,1.78,1.59,1.0,0.14,0.07,0.06,0.08,1.0,0.14,0.07,0.06,0.08,277.0,39.57,19.79,14.75,18.64,277.0,39.57,19.79,16.11,19.65,0.0,19.13,33.54,32.93,29.0,0.0,0.14,0.29,0.29,0.25,133.17
203,muebles,"consolas, recibidores y tocadores",200.0,0.39,0.33,0.32,0.37,0.39,1275.0,1144.86,1181.0,1043.95,963.54,6.0,5.57,8.71,7.9,7.36,0.004706,0.0,0.01,0.01,0.01,19.0,11.86,8.57,11.05,12.93,2.34,1.83,3.32,3.2,3.01,0.33,0.05,0.02,0.06,0.05,0.33,0.05,0.02,0.06,0.05,4.97,0.71,0.36,13.43,10.07,4.97,0.71,0.36,13.43,10.07,0.0,0.0,10.51,13.38,10.04,0.0,0.0,0.07,0.1,0.07,2.12
29,muebles,"consolas, recibidores y tocadores",200.0,0.29,0.26,0.26,0.27,0.27,3186.0,3512.0,3512.0,3512.0,3512.0,12.0,14.5,14.5,14.5,14.5,0.003766,0.0,0.0,0.0,0.0,9.0,18.5,18.5,18.5,18.5,3.5,3.95,3.95,3.95,3.95,0.33,0.16,0.16,0.16,0.16,0.33,0.66,0.66,0.66,0.66,58.81,29.4,29.4,29.4,29.4,58.81,29.4,29.4,29.4,29.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.8


In [62]:
df.shape

(494, 64)

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 494 entries, 136 to 204
Data columns (total 64 columns):
CatN1                 494 non-null object
CatN2                 494 non-null object
Cat_Price             494 non-null object
CPC_medio             494 non-null float64
CPC_medio_1w          494 non-null float64
CPC_medio_2w          494 non-null float64
CPC_medio_3w          494 non-null float64
CPC_medio_4w          494 non-null float64
Impressions           494 non-null float64
Impressions_1w        494 non-null float64
Impressions_2w        494 non-null float64
Impressions_3w        494 non-null float64
Impressions_4w        494 non-null float64
Clics                 494 non-null float64
Clics_1w              494 non-null float64
Clics_2w              494 non-null float64
Clics_3w              494 non-null float64
Clics_4w              494 non-null float64
CTR                   494 non-null float64
CTR_1w                494 non-null float64
CTR_2w                494 non-null f

Save the dataset

In [64]:
df.to_pickle('./data/final/data_final_ROAS_ADS.pkl')