# Analyse Customer Behaviour in a Multi Category e-Commerce Website
The dataset contains customer behaviour data of a large multi category e-commerce website. The customer behaviour is reflected in the `event_type` field which is either view, cart or purchase. Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users. This exercise uses the 2019 October dataset published in Kaggle https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store. The dataset originally collected from Open CDP https://rees46.com/en/open-cdp. 

The selected dataset is approximately 5Gb in volume which makes data processing a difficult in our usual RStudio or Colab environment. Therefore, we need to move into a big data technology to process this dataset. In this exercise, we run the exerecise in a Hadoop run on a cloud environment with MapReduce. MapReduce programs are natively support by Java. However, `mrjob` package in python provides a `hadoop streaming` interface where we can write mapreduce programs using python. Another benifit of using `mrjob` is its ability to include multiple mappers and reducers in the same program. 

## 1. Install `mrjob`

In [1]:
! pip install mrjob



## Exercise 01
Using python `mrjob` package write a mapReduce programs to get the below results.

    1. Total value of customer behaviour by event type.
    2. Top 10 brands purchased by value
    3. Top 10 brands purchased by volume

In [2]:
%%file ~/customer_behaviour/mrjobs/value_by_event_type.py
from mrjob.job import MRJob

class MapperReducer(MRJob):

    def mapper(self, _, line):
        row = line.split(',')
        event_type = row[1]      
        try:
            price = float(row[6])
        except ValueError:
            b'skipping line'
        else:
            yield(event_type,price)
        
    def reducer(self, event_type, value):
        yield(event_type, sum(value))

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/value_by_event_type.py


In [3]:
! python ~/mrjobs/value_by_event_type.py sample.csv > ~/customer_behaviour/mrjobs/output/output_1_1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/value_by_event_type.jovyan.20221220.135454.600920
Running step 1 of 1...
job output is in /tmp/value_by_event_type.jovyan.20221220.135454.600920/output
Streaming final output from /tmp/value_by_event_type.jovyan.20221220.135454.600920/output...
Removing temp directory /tmp/value_by_event_type.jovyan.20221220.135454.600920...


In [4]:
%%file ~/customer_behaviour/mrjobs/top_brands_value.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    MRJob.SORT_VALUES = True
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper, 
                reducer=self.reducer_sum
            )
            ,
            MRStep(
                mapper=self.mapper_sort
                ,reducer=self.reducer_sort
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        brand = row[5]      
        try:
            price = float(row[6])
        except ValueError:
            b'skipping line'
        else:
            yield brand, price        
            
    def reducer_sum(self, brand, price):
        yield brand, sum(price)
        
    def mapper_sort(self, brand, total):
        yield None, ("%9.02f"%(float(total)), brand)
        
    def reducer_sort(self, n, brand_value):
        for c in brand_value:
            yield c[0], c[1]

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/top_brands_value.py


In [5]:
! python ~/customer_behaviour/mrjobs/top_brands_value.py sample.csv > ~/customer_behaviour/mrjobs/output/output_1_2.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/top_brands_value.jovyan.20221220.135456.326718
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/top_brands_value.jovyan.20221220.135456.326718/output
Streaming final output from /tmp/top_brands_value.jovyan.20221220.135456.326718/output...
Removing temp directory /tmp/top_brands_value.jovyan.20221220.135456.326718...


In [6]:
%%file ~/customer_behaviour/mrjobs/top_brands_volume.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    MRJob.SORT_VALUES = True
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper, 
                reducer=self.reducer_sum
            )
            ,
            MRStep(
                mapper=self.mapper_sort
                ,reducer=self.reducer_sort
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        brand = row[5]      
        try:
            price = float(row[6])
        except ValueError:
            b'skipping line'
        else:
            yield brand, 1        
            
    def reducer_sum(self, brand, count):
        yield brand, sum(count)
        
    def mapper_sort(self, brand, count):
        yield None, ("%9.02f"%(float(count)), brand)
        
    def reducer_sort(self, n, brand_volume):
        for c in brand_volume:
            yield c[0], c[1]

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/top_brands_volume.py


In [7]:
! python ~/customer_behaviour/mrjobs/top_brands_volume.py sample.csv > ~/customer_behaviour/mrjobs/output/output_1_3.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/top_brands_volume.jovyan.20221220.135458.244386
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/top_brands_volume.jovyan.20221220.135458.244386/output
Streaming final output from /tmp/top_brands_volume.jovyan.20221220.135458.244386/output...
Removing temp directory /tmp/top_brands_volume.jovyan.20221220.135458.244386...


## Exercise 02

The `category_code` variable includes the product category and sub-categories deliemeted by a `.`. Using a string manipulation, extract the following query results.

1. Highest value category
3. Top product categories by value
4. Top product categories by volume

In [8]:
%%file ~/customer_behaviour/mrjobs/max_category.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper 
                ,reducer=self.reducer_sum
            )
            ,
            MRStep(
                reducer=self.reducer_max
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        category = row[4].split('.')[0]      
        try:
            price = float(row[6])
        except ValueError:
            b'skipping line'
        else:
            yield category, price        
            
    def reducer_sum(self, category, value):
        yield None, (sum(value), category)
        
    def reducer_max(self, _, value):
        yield max(value)
        

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/max_category.py


In [9]:
! python ~/customer_behaviour/mrjobs/max_category.py sample.csv > ~/customer_behaviour/mrjobs/output/output_2_1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/max_category.jovyan.20221220.135500.132179
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/max_category.jovyan.20221220.135500.132179/output
Streaming final output from /tmp/max_category.jovyan.20221220.135500.132179/output...
Removing temp directory /tmp/max_category.jovyan.20221220.135500.132179...


In [10]:
%%file ~/customer_behaviour/mrjobs/product_category_value.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    MRJob.SORT_VALUES = True
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper 
                ,reducer=self.reducer_sum
            )
            ,
            MRStep(
                mapper=self.mapper_sort
                ,reducer=self.reducer_sort
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        category = row[4].split('.')[0]      
        try:
            price = float(row[6])
        except ValueError:
            b'skipping line'
        else:
            yield category, price        
            
    def reducer_sum(self, category, price):
        yield category, sum(price)
        
    def mapper_sort(self, category, value):
        yield None, ("%12.02f"%(float(value)), category)
        
    def reducer_sort(self, n, cat_value):
        for c in cat_value:
            yield c[0], c[1]

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/product_category_value.py


In [11]:
! python ~/customer_behaviour/mrjobs/product_category_value.py sample.csv > ~/customer_behaviour/mrjobs/output/output_2_2.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/product_category_value.jovyan.20221220.135501.980377
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/product_category_value.jovyan.20221220.135501.980377/output
Streaming final output from /tmp/product_category_value.jovyan.20221220.135501.980377/output...
Removing temp directory /tmp/product_category_value.jovyan.20221220.135501.980377...


In [12]:
%%file ~/customer_behaviour/mrjobs/product_category_volume.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    MRJob.SORT_VALUES = True
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper 
                ,reducer=self.reducer_sum
            )
            ,
            MRStep(
                mapper=self.mapper_sort
                ,reducer=self.reducer_sort
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        category = row[4].split('.')[0]      
        try:
            price = float(row[6])
        except ValueError:
            b'skipping line'
        else:
            yield category, 1        
            
    def reducer_sum(self, category, count):
        yield category, sum(count)
        
    def mapper_sort(self, category, count):
        yield None, ("%7.02f"%(float(count)), category)
        
    def reducer_sort(self, n, cat_volume):
        for c in cat_volume:
            yield c[0], c[1]

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/product_category_volume.py


In [13]:
! python ~/customer_behaviour/mrjobs/product_category_volume.py sample.csv > ~/customer_behaviour/mrjobs/output/output_2_3.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/product_category_volume.jovyan.20221220.135503.769489
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/product_category_volume.jovyan.20221220.135503.769489/output
Streaming final output from /tmp/product_category_volume.jovyan.20221220.135503.769489/output...
Removing temp directory /tmp/product_category_volume.jovyan.20221220.135503.769489...


## Exercise 03
The dataset includes items viewed by the users. This can be identified using the `event_type`. The company wants to analyse the daily view pattern during the month. First we need to generate the date from the timestamp value. Then the company requires us to generate the following query results.

1. Create daily view pattern of apple products
2. Visualise the timeseries using a line chart
3. Compare the view patterns of apple vs samsung
4. Compare the two frequency distributions

In [14]:
%%file ~/customer_behaviour/mrjobs/apple_views.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    MRJob.SORT_VALUES = True
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper 
                ,reducer=self.reducer
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        brand = row[5]
        date = row[0].split()[0]
        if (brand =='apple'):
            yield date, 1        
            
    def reducer(self, date, count):
        yield date, sum(count)

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/apple_views.py


In [15]:
! python ~/customer_behaviour/mrjobs/apple_views.py sample.csv > ~/customer_behaviour/mrjobs/output/output_3_1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/apple_views.jovyan.20221220.135505.584545
Running step 1 of 1...
job output is in /tmp/apple_views.jovyan.20221220.135505.584545/output
Streaming final output from /tmp/apple_views.jovyan.20221220.135505.584545/output...
Removing temp directory /tmp/apple_views.jovyan.20221220.135505.584545...


In [16]:
%%file ~/customer_behaviour/mrjobs/samsung_views.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class MapperReducer(MRJob):
    MRJob.SORT_VALUES = True
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper 
                ,reducer=self.reducer
            )
        ]
    def mapper(self, _, line):
        row = line.split(',')
        brand = row[5]
        date = row[0].split()[0]
        if (brand =='samsung'):
            yield date, 1        
            
    def reducer(self, date, count):
        yield date, sum(count)

if __name__ == '__main__':
    MapperReducer.run()

Writing /home/jovyan/customer_behaviour/mrjobs/samsung_views.py


In [17]:
! python ~/customer_behaviour/mrjobs/samsung_views.py sample.csv > ~/customer_behaviour/mrjobs/output/output_3_2.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/samsung_views.jovyan.20221220.135507.310499
Running step 1 of 1...
job output is in /tmp/samsung_views.jovyan.20221220.135507.310499/output
Streaming final output from /tmp/samsung_views.jovyan.20221220.135507.310499/output...
Removing temp directory /tmp/samsung_views.jovyan.20221220.135507.310499...


## Exercise 04 

`suppliers.txt` file includes 20 brands and its supplier codes which managemenet requires to further analyse. Using `mrjob` package, create a mapreduce program to join `suppliers.txt` with the `sample.txt` and calculate the total value of the suppliers.

In [18]:
%%file ~/customer_behaviour/mrjobs/reduce_join.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class InnerJoin(MRJob):

    def mapper(self, _, line):
        fields=line.split(',')
        if len(fields) == 9:
            join_key = fields[5]
            try:
                join_value = float(fields[6])
            except ValueError:
                b'skipping line'
            else:
                yield (join_key, ('T', join_value))
            
        elif len(fields) == 2: 
            join_key  = fields[0]
            join_value = fields[1]
            yield (join_key, ('M', join_value))
            
        else:
            pass
        
    def reducer_join(self, key, values):
        master_tuples = []
        transactions_tuples = []

        for value in values:
            relation_symbol = value[0]
            if relation_symbol == 'M': 
                master_tuples.append(value[1])
            elif relation_symbol == 'T':
                transactions_tuples.append(value[1])
            else:
                pass
            
        if len(master_tuples) > 0 and len(transactions_tuples) > 0:
            for value in transactions_tuples:
                yield (master_tuples[0], value)
    
    def reducer_sum(self, supplier, value):
        yield(supplier, sum(value))
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper
                   ,reducer=self.reducer_join)
            ,MRStep(reducer=self.reducer_sum)
        ]
if __name__ == '__main__':
    InnerJoin.run()

Writing /home/jovyan/customer_behaviour/mrjobs/reduce_join.py


In [19]:
! python ~/customer_behaviour/mrjobs/reduce_join.py sample.csv suppliers.txt > ~/customer_behaviour/mrjobs/output/output_4_1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/reduce_join.jovyan.20221220.135508.939617
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/reduce_join.jovyan.20221220.135508.939617/output
Streaming final output from /tmp/reduce_join.jovyan.20221220.135508.939617/output...
Removing temp directory /tmp/reduce_join.jovyan.20221220.135508.939617...
