# Assignment on Map-Reduce

In the following questions, you will solve real problems with the techniques you have learned before. You will be working with data of **google play dataset** which includes the following datasets: 
1. googleplaystore
2. googleplaystore review

*You can find the data files on the archive have attached to this exercise.*

**T** (10pts) Number of applications according to the version of Android(output must be sorted)


> result: 
`<version, count>`

In [85]:
%%file version_count.py
from mrjob.job import MRJob,MRStep
from mrjob.protocol import TextValueProtocol
class MRVersionCount(MRJob):
    INPUT_PROTOCOL=TextValueProtocol
    def steps(self):
        return [
            MRStep(mapper=self.mapper_count_version,
                   combiner=self.combiner_count_version,
                   reducer=self.reducer_count_version),
            ]
    def mapper_count_version(self, _, line):
        version=line.strip().split('∑')[-1]
        if version not in ['Android Ver']:
            yield version,1

    def combiner_count_version(self, version, counts):
        yield (version, sum(counts))

    def reducer_count_version(self, version, counts):
        yield version, sum(counts)
        pass

if __name__=='__main__':
    MRVersionCount.run()

Overwriting version_count.py


#### Important note
When using TextValueProtocol, each line behave as utf-8 encoded str.

In [86]:
!python version_count.py googleplaystore.txt

"1.0 and up"	2
"1.5 and up"	20
"1.6 and up"	116
"2.0 and up"	32
"2.0.1 and up"	7
"2.1 and up"	134
"2.2 - 7.1.1"	1
"2.2 and up"	244
"2.3 and up"	652
"2.3.3 and up"	281
"3.0 and up"	241
"3.1 and up"	10
"3.2 and up"	36
"4.0 and up"	1375
"4.0.3 - 7.1.1"	2
"4.0.3 and up"	1501
"4.1 - 7.1.1"	1
"4.1 and up"	2451
"4.2 and up"	394
"4.3 and up"	243
"4.4 and up"	980
"4.4W and up"	12
"5.0 - 6.0"	1
"5.0 - 7.1.1"	1
"5.0 - 8.0"	2
"5.0 and up"	600
"5.1 and up"	24
"6.0 and up"	60
"7.0 - 7.1.1"	1
"7.0 and up"	42
"7.1 and up"	3
"8.0 and up"	6
"NaN"	2
"Varies with device"	1362


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\MOHAMM~1\AppData\Local\Temp\version_count.Mohammadreza.20230109.191151.402070
Running step 1 of 1...
job output is in C:\Users\MOHAMM~1\AppData\Local\Temp\version_count.Mohammadreza.20230109.191151.402070\output
Streaming final output from C:\Users\MOHAMM~1\AppData\Local\Temp\version_count.Mohammadreza.20230109.191151.402070\output...
Removing temp directory C:\Users\MOHAMM~1\AppData\Local\Temp\version_count.Mohammadreza.20230109.191151.402070...


In [115]:
%%file version_count_optimized.py
from mrjob.job import MRJob,MRStep
from mrjob.protocol import TextValueProtocol
from mrjob import protocol

class MRVersionCount(MRJob):
    INPUT_PROTOCOL=TextValueProtocol
    def configure_args(self):
        super(MRVersionCount, self).configure_args()
        self.add_passthru_arg(
        '--ignore-words', 
        type=str, 
        default='', 
        help='how many lines skipped from the first of input file')

    def steps(self):
        return [
            MRStep(
                mapper_init=self.mapper_skip_lines,
                mapper=self.mapper_count_version,
                combiner=self.combiner_count_version,
                reducer=self.reducer_count_version),
            ]
    def mapper_skip_lines(self):
        self.ignore_words = self\
            .options\
                .ignore_words\
                    .strip()\
                        .split(',')
    def mapper_count_version(self, _, line):
        version=line.strip().split('∑')[-1]
        if version not in self.ignore_words:
            yield version,1

    def combiner_count_version(self, version, counts):
        yield (version, sum(counts))

    def reducer_count_version(self, version, counts):
        yield version, sum(counts)
        pass

if __name__=='__main__':
    MRVersionCount.run()

Overwriting version_count_optimized.py


In [106]:
!python version_count_optimized.py \
    googleplaystore.txt \
    --ignore-words="Android Ver,NaN"

"1.0 and up"	2
"1.5 and up"	20
"1.6 and up"	116
"2.0 and up"	32
"2.0.1 and up"	7
"2.1 and up"	134
"2.2 - 7.1.1"	1
"2.2 and up"	244
"2.3 and up"	652
"2.3.3 and up"	281
"3.0 and up"	241
"3.1 and up"	10
"3.2 and up"	36
"4.0 and up"	1375
"4.0.3 - 7.1.1"	2
"4.0.3 and up"	1501
"4.1 - 7.1.1"	1
"4.1 and up"	2451
"4.2 and up"	394
"4.3 and up"	243
"4.4 and up"	980
"4.4W and up"	12
"5.0 - 6.0"	1
"5.0 - 7.1.1"	1
"5.0 - 8.0"	2
"5.0 and up"	600
"5.1 and up"	24
"6.0 and up"	60
"7.0 - 7.1.1"	1
"7.0 and up"	42
"7.1 and up"	3
"8.0 and up"	6
"Varies with device"	1362


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\MOHAMM~1\AppData\Local\Temp\version_count_optimized.Mohammadreza.20230109.192430.052762
Running step 1 of 1...
job output is in C:\Users\MOHAMM~1\AppData\Local\Temp\version_count_optimized.Mohammadreza.20230109.192430.052762\output
Streaming final output from C:\Users\MOHAMM~1\AppData\Local\Temp\version_count_optimized.Mohammadreza.20230109.192430.052762\output...
Removing temp directory C:\Users\MOHAMM~1\AppData\Local\Temp\version_count_optimized.Mohammadreza.20230109.192430.052762...


#### Mrjob Test Case

In [23]:
from unittest import TestCase
import unittest
from version_count_optimized import MRVersionCount

class MRVersionCountTestCase(TestCase):
    def test_mapper(self):
        j = MRVersionCount([])
        j.mapper_skip_lines()
        self.assertEqual(next(j.mapper_count_version(None,'NaN')),('NaN',1))

    def test_mapper_ignore_words(self):
        j = MRVersionCount(['--ignore-words=Android Ver,NaN'])
        j.mapper_skip_lines()
        with self.assertRaises(StopIteration):
            next(j.mapper_count_version(None, "NaN"))

if __name__ == "__main__":
    suite = unittest.defaultTestLoader.loadTestsFromTestCase(MRVersionCountTestCase)
    unittest.TextTestRunner().run(suite)

..
----------------------------------------------------------------------
Ran 2 tests in 0.008s

OK


AttributeError: 'MRVersionCountTestCase' object has no attribute 'runTest'

**T** (10pts) K of the best applications in every category(K should be specified by user)


> result:
`<appname,{other fields} >`

**T** (20pts) Number of applications in every category according to version of Android(output must be sorted on Count)

> result:
`<category, {count, version} >`

**T** (60pts) In the review dataset which words have more occurrence in every application(output must be sorted on Count)

> result: 
`<appname, {count, word1, word2} >`

`hint:` use secondary sort