<b>This notebook is just a tutorial for you to get familiar with skip-gram and MapReduce.  
<font color="red">You don't need to hand in this notebook</font>, so feel free to jump to [Requirement section](#Assignment-Requirement) and directly work on your `mapper.py` and `reducer.py` if you already have the idea of how to do so.</b>  

# Week 05: Skip-gram and MapReduce

In previous assignments, you have known the concept of ngrams and how to generate them.  
This week, we are introducing another gram type, called *skip-gram*, to you.  
Also, we are going to calculate it on a large dataset, so you'll have to process it with the MapReduce technique.  

So, first thing first: what is skip-gram?  

## Skip-gram

<i>\[S\]kip-grams are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.  - from [Wikipedia](https://en.wikipedia.org/wiki/N-gram#Skip-gram)</i>  

That is, skip-gram is actually the same as ngram, but allowed to skip some words in between.  
In the sentence <i>"Strong winds blew roofs away"</i>, two of its bigrams are <i>"winds blew"</i> and <i>"blew roofs"</i>, while <i>"blew away"</i> is one of the skipgrams with distance 2, since it skipped one word <i>"roofs"</i> .  
As you can see, skipgram is able to capture the phrase seperated by other words.  

Now consider another sentence

> "Skip-gram is used to predict the context word for a given target word".

With a pivot word *predict*, all of its skip-grams within distance 5 are as below:
```
.------------------------------------------------------------------------------.
| distance || -5 |     -4    | -3 |  -2  | -1 |  1  |    2    |  3   |  4  | 5 |
|----------||----|-----------|----|------|----|-----|---------|------|-----|---|
| predict  || -  | Skip-gram | is | used | to | the | context | word | for | a |
'------------------------------------------------------------------------------'
```

<a name="Practice"></a>
### Practice: Distance table of skip-gram

Now, let's practice!  
Given a sentence <i>"Skip-gram is used to predict the context word for a given target word"</i>, <u>**output all of its skip-gram with distance between -3 to 3</u> and show the result in a table**.  

**Example**
```
distance      -3            -2            -1            1             2             3             
--------------------------------------------------------------------------------------------
Skip-gram     -             -             -             is            used          to            
is            -             -             Skip-gram     used          to            predict       
used          -             Skip-gram     is            to            predict       the           
to            Skip-gram     is            used          predict       the           context       
predict       is            used          to            the           context       word          
the           used          to            predict       context       word          for           
context       to            predict       the           word          for           a             
word          predict       the           context       for           a             given         
for           the           context       word          a             given         target        
a             context       word          for           given         target        word          
given         word          for           a             target        word          -             
target        for           a             given         word          -             -             
word          a             given         target        -             -             -             
```

\*Hint: Try to get the skip-grams for a single word first if you have trouble generating them all at once. 
```
(predict, is, -3)
(predict, used, -2)
(predict, to, -1)
(predict, the, 1)
(predict, context, 2)
(predict, word, 3)
```

In [17]:
tokens = "Skip-gram is used to predict the context word for a given target word".split()
token_length = len(tokens)
dis_range = 3

dis_str = "distance".ljust(15)
for i in range(dis_range*2):
    if i-dis_range >=0:
        dis_str += str(i-dis_range+1).ljust(15)
    else:
        dis_str += str(i-dis_range).ljust(15)
print(dis_str)

for idx in range(token_length):
    tmp_str = str(tokens[idx]).ljust(15)
    for i in reversed(range(dis_range)):
        if idx-i-1 < 0:
            tmp_str += "-".ljust(15)
        else:
            tmp_str += str(tokens[idx-i-1]).ljust(15)
    for i in range(dis_range):
        if idx+i+1 >= token_length:
            tmp_str += "-".ljust(15)
        else:
            tmp_str += str(tokens[idx+i+1]).ljust(15)
    print(tmp_str)

distance       -3             -2             -1             1              2              3              
Skip-gram      -              -              -              is             used           to             
is             -              -              Skip-gram      used           to             predict        
used           -              Skip-gram      is             to             predict        the            
to             Skip-gram      is             used           predict        the            context        
predict        is             used           to             the            context        word           
the            used           to             predict        context        word           for            
context        to             predict        the            word           for            a              
word           predict        the            context        for            a              given          
for            the            context        w

## MapReduce

<i>MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. - from [Wikipedia](https://en.wikipedia.org/wiki/MapReduce)</i> 

### Why MapReduce?

Imagine that you are working on a pretty large dataset, say all pages on Wikipedia (whose size has already reached 94GB in 2013).  
Most likely you are not able to process the whole corpus in the memory or on a single computer. Even a simple frequency counter would be challenging under such a huge data size.  
To deal with this, Google proposed a big-data processing model called MapReduce, and it has been implemented and supported by many distributed computing systems, such as Apache Hadoop.  
The core concept of MapReduce is to **split, apply and then combine**, so that each data segment can be handled separately.  

### Mapper-Shuffler-Reducer

![](https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png)
<small><i> - image source: [Today Software Magazine](https://www.todaysoftmag.com/article/1358/hadoop-mapreduce-deep-diving-and-tuning)</i></small> 

As you can see in the picture:  
First, the whole data is split into some smaller partitions, each partition able to be processed by an independant machine.  
In this step, **mappers** will generate one or more key-value pair(s) that can easily be clustered.  
 - example: in a word counter, it would generate the word and the word's current count.  

Then, we will **shuffle** and group all outputs from mappers.  
 - example: sort the output from mappers.  

Lastly, we can combine the grouped values and **reduce** them into final results.  
 - example: calculate total frequency in each group.  

## MapReduce for skip-gram

Now, having the concepts of skip-gram and MapReduce in mind, it's time to put these all together: let's generate skip-gram table with MapReduce technique!  

It may sound scary to some of you, so let's break it down first.  
There are 3 steps to do, and each step is described as below:
1. **Mapper**: Print all skip-gram with its distance infomation, and the current count of it.  
   ```
   a b -3 1
   a c 3  1
   c e -2 1
   a c 1  1
   a b -3 1
   b d 2  1
   ```
2. **Shuffler**: Group all skipgrams by its text. This can be easily achieved with sorting.  
   ```
   a b -3 1
   a b -3 1
   a c 1  1
   a c 3  1
   b d 2  1
   c e -2 1
   ```
3. **Reducer** :  
   Since the results have been sorted in the previous step, we can easily calculate the frequency of each skip-gram with different distance.  
   So we can know that the frequency of skipgram `a b` with distance $-3$ should be $1+1=2$, while other skip-grams' are all $1$.

### Step 1: Mapper

First, in the mapper we want to generate all skip-grams within distance $-5$ to $5$.  
Remember that you've already done something similar in [previous Practice](#Practice)? Just modify it to MapReduce format!  

Output: 
 - `"{pivot}\t{word}\t{distance}\t{count}"`  

Example: 
```
predict is  -3  1
predict used    -2  1
predict the -1  1
predict the 1   1
...
```



In [54]:
import os
from string import punctuation

In [80]:
dis_range = 5
showlen = 10
cnt = 1

with open(os.path.join('data', 'wiki1G.txt'), encoding="utf-8") as f:
    for line in f:
        line = line.lower()
        Regtmp = line.split(" ")
        Regtmp = filter(None, Regtmp) # remove None from the List
        Reg = [i.strip(punctuation) for i in Regtmp]
        tokens = Reg
        token_length = len(tokens)
        for idx in range(token_length):
            for i in reversed(range(dis_range)):
                tmp_str = str(tokens[idx]) + "\t"
                if idx-i-1 < 0:
                    continue
                else:
                    tmp_str += str(tokens[idx-i-1]) + "\t" + str(-i-1) + "\t" + str(cnt) + "\n"
                    print(tmp_str)
            for i in range(dis_range):
                tmp_str = str(tokens[idx]) + "\t"
                if idx+i+1 >= token_length:
                    continue
                else:
                    tmp_str += str(tokens[idx+i+1]) + "\t" + str(i+1) + "\t" + str(cnt) + "\n"
                    print(tmp_str)
        
        break # for the sake of this practice, just test the first page now

anarchism	anarchism	1	1

anarchism	is	2	1

anarchism	a	3	1

anarchism	political	4	1

anarchism	philosophy	5	1

anarchism	anarchism	-1	1

anarchism	is	1	1

anarchism	a	2	1

anarchism	political	3	1

anarchism	philosophy	4	1

anarchism	and	5	1

is	anarchism	-2	1

is	anarchism	-1	1

is	a	1	1

is	political	2	1

is	philosophy	3	1

is	and	4	1

is	movement	5	1

a	anarchism	-3	1

a	anarchism	-2	1

a	is	-1	1

a	political	1	1

a	philosophy	2	1

a	and	3	1

a	movement	4	1

a	that	5	1

political	anarchism	-4	1

political	anarchism	-3	1

political	is	-2	1

political	a	-1	1

political	philosophy	1	1

political	and	2	1

political	movement	3	1

political	that	4	1

political	is	5	1

philosophy	anarchism	-5	1

philosophy	anarchism	-4	1

philosophy	is	-3	1

philosophy	a	-2	1

philosophy	political	-1	1

philosophy	and	1	1

philosophy	movement	2	1

philosophy	that	3	1

philosophy	is	4	1

philosophy	sceptical	5	1

and	anarchism	-5	1

and	is	-4	1

and	a	-3	1

and	political	-2	1

and	philosophy	-1	1

and	moveme


having	violent	3	1

having	turn	4	1

having	in	5	1

taken	down	-5	1

taken	authority	-4	1

taken	and	-3	1

taken	state	-2	1

taken	having	-1	1

taken	a	1	1

taken	violent	2	1

taken	turn	3	1

taken	in	4	1

taken	the	5	1

a	authority	-5	1

a	and	-4	1

a	state	-3	1

a	having	-2	1

a	taken	-1	1

a	violent	1	1

a	turn	2	1

a	in	3	1

a	the	4	1

a	past	5	1

violent	and	-5	1

violent	state	-4	1

violent	having	-3	1

violent	taken	-2	1

violent	a	-1	1

violent	turn	1	1

violent	in	2	1

violent	the	3	1

violent	past	4	1

violent	evolutionary	5	1

turn	state	-5	1

turn	having	-4	1

turn	taken	-3	1

turn	a	-2	1

turn	violent	-1	1

turn	in	1	1

turn	the	2	1

turn	past	3	1

turn	evolutionary	4	1

turn	tactics	5	1

in	having	-5	1

in	taken	-4	1

in	a	-3	1

in	violent	-2	1

in	turn	-1	1

in	the	1	1

in	past	2	1

in	evolutionary	3	1

in	tactics	4	1

in	aim	5	1

the	taken	-5	1

the	a	-4	1

the	violent	-3	1

the	turn	-2	1

the	in	-1	1

the	past	1	1

the	evolutionary	2	1

the	tactics	3	1

the	aim	4	1

t

been	synonymous	2	1

been	with	3	1

been	anarchism	4	1

been	its	5	1

largely	the	-5	1

largely	term	-4	1

largely	libertarian	-3	1

largely	has	-2	1

largely	been	-1	1

largely	synonymous	1	1

largely	with	2	1

largely	anarchism	3	1

largely	its	4	1

largely	meaning	5	1

synonymous	term	-5	1

synonymous	libertarian	-4	1

synonymous	has	-3	1

synonymous	been	-2	1

synonymous	largely	-1	1

synonymous	with	1	1

synonymous	anarchism	2	1

synonymous	its	3	1

synonymous	meaning	4	1

synonymous	has	5	1

with	libertarian	-5	1

with	has	-4	1

with	been	-3	1

with	largely	-2	1

with	synonymous	-1	1

with	anarchism	1	1

with	its	2	1

with	meaning	3	1

with	has	4	1

with	more	5	1

anarchism	has	-5	1

anarchism	been	-4	1

anarchism	largely	-3	1

anarchism	synonymous	-2	1

anarchism	with	-1	1

anarchism	its	1	1

anarchism	meaning	2	1

anarchism	has	3	1

anarchism	more	4	1

anarchism	recently	5	1

its	been	-5	1

its	largely	-4	1

its	synonymous	-3	1

its	with	-2	1

its	anarchism	-1	1

its	meaning	1	


to	in	2	1

to	or	3	1

to	progress	4	1

to	toward	5	1

exist	human	-5	1

exist	nature	-4	1

exist	allows	-3	1

exist	humans	-2	1

exist	to	-1	1

exist	in	1	1

exist	or	2	1

exist	progress	3	1

exist	toward	4	1

exist	such	5	1

in	nature	-5	1

in	allows	-4	1

in	humans	-3	1

in	to	-2	1

in	exist	-1	1

in	or	1	1

in	progress	2	1

in	toward	3	1

in	such	4	1

in	a	5	1

or	allows	-5	1

or	humans	-4	1

or	to	-3	1

or	exist	-2	1

or	in	-1	1

or	progress	1	1

or	toward	2	1

or	such	3	1

or	a	4	1

or	non-coercive	5	1

progress	humans	-5	1

progress	to	-4	1

progress	exist	-3	1

progress	in	-2	1

progress	or	-1	1

progress	toward	1	1

progress	such	2	1

progress	a	3	1

progress	non-coercive	4	1

progress	society	5	1

toward	to	-5	1

toward	exist	-4	1

toward	in	-3	1

toward	or	-2	1

toward	progress	-1	1

toward	such	1	1

toward	a	2	1

toward	non-coercive	3	1

toward	society	4	1

toward	and	5	1

such	exist	-5	1

such	in	-4	1

such	or	-3	1

such	progress	-2	1

such	toward	-1	1

such	a	1	1

such	no

taoist	philosophers	1	1

taoist	zhuang	2	1

taoist	zhou	3	1

taoist	and	4	1

taoist	laozi	5	1

philosophers	state	-5	1

philosophers	was	-4	1

philosophers	delineated	-3	1

philosophers	by	-2	1

philosophers	taoist	-1	1

philosophers	zhuang	1	1

philosophers	zhou	2	1

philosophers	and	3	1

philosophers	laozi	4	1

philosophers	alongside	5	1

zhuang	was	-5	1

zhuang	delineated	-4	1

zhuang	by	-3	1

zhuang	taoist	-2	1

zhuang	philosophers	-1	1

zhuang	zhou	1	1

zhuang	and	2	1

zhuang	laozi	3	1

zhuang	alongside	4	1

zhuang	stoicism	5	1

zhou	delineated	-5	1

zhou	by	-4	1

zhou	taoist	-3	1

zhou	philosophers	-2	1

zhou	zhuang	-1	1

zhou	and	1	1

zhou	laozi	2	1

zhou	alongside	3	1

zhou	stoicism	4	1

zhou	taoism	5	1

and	by	-5	1

and	taoist	-4	1

and	philosophers	-3	1

and	zhuang	-2	1

and	zhou	-1	1

and	laozi	1	1

and	alongside	2	1

and	stoicism	3	1

and	taoism	4	1

and	has	5	1

laozi	taoist	-5	1

laozi	philosophers	-4	1

laozi	zhuang	-3	1

laozi	zhou	-2	1

laozi	and	-1	1

laozi	alongside	


unprecedented	globalization	1	1

unprecedented	occurred	2	1

unprecedented	from	3	1

unprecedented	1880	4	1

unprecedented	to	5	1

globalization	a	-5	1

globalization	wave	-4	1

globalization	of	-3	1

globalization	then	-2	1

globalization	unprecedented	-1	1

globalization	occurred	1	1

globalization	from	2	1

globalization	1880	3	1

globalization	to	4	1

globalization	1914	5	1

occurred	wave	-5	1

occurred	of	-4	1

occurred	then	-3	1

occurred	unprecedented	-2	1

occurred	globalization	-1	1

occurred	from	1	1

occurred	1880	2	1

occurred	to	3	1

occurred	1914	4	1

occurred	this	5	1

from	of	-5	1

from	then	-4	1

from	unprecedented	-3	1

from	globalization	-2	1

from	occurred	-1	1

from	1880	1	1

from	to	2	1

from	1914	3	1

from	this	4	1

from	era	5	1

1880	then	-5	1

1880	unprecedented	-4	1

1880	globalization	-3	1

1880	occurred	-2	1

1880	from	-1	1

1880	to	1	1

1880	1914	2	1

1880	this	3	1

1880	era	4	1

1880	of	5	1

to	unprecedented	-5	1

to	globalization	-4	1

to	occurred	-3	1



the	of	-1	1

the	deed	1	1

the	the	2	1

the	dismemberment	3	1

the	of	4	1

the	the	5	1

deed	known	-5	1

deed	as	-4	1

deed	propaganda	-3	1

deed	of	-2	1

deed	the	-1	1

deed	the	1	1

deed	dismemberment	2	1

deed	of	3	1

deed	the	4	1

deed	french	5	1

the	as	-5	1

the	propaganda	-4	1

the	of	-3	1

the	the	-2	1

the	deed	-1	1

the	dismemberment	1	1

the	of	2	1

the	the	3	1

the	french	4	1

the	socialist	5	1

dismemberment	propaganda	-5	1

dismemberment	of	-4	1

dismemberment	the	-3	1

dismemberment	deed	-2	1

dismemberment	the	-1	1

dismemberment	of	1	1

dismemberment	the	2	1

dismemberment	french	3	1

dismemberment	socialist	4	1

dismemberment	movement	5	1

of	of	-5	1

of	the	-4	1

of	deed	-3	1

of	the	-2	1

of	dismemberment	-1	1

of	the	1	1

of	french	2	1

of	socialist	3	1

of	movement	4	1

of	into	5	1

the	the	-5	1

the	deed	-4	1

the	the	-3	1

the	dismemberment	-2	1

the	of	-1	1

the	french	1	1

the	socialist	2	1

the	movement	3	1

the	into	4	1

the	many	5	1

french	deed	-5	1

frenc


of	control	-1	1

of	barcelona	1	1

of	and	2	1

of	of	3	1

of	large	4	1

of	areas	5	1

barcelona	armed	-5	1

barcelona	militias	-4	1

barcelona	took	-3	1

barcelona	control	-2	1

barcelona	of	-1	1

barcelona	and	1	1

barcelona	of	2	1

barcelona	large	3	1

barcelona	areas	4	1

barcelona	of	5	1

and	militias	-5	1

and	took	-4	1

and	control	-3	1

and	of	-2	1

and	barcelona	-1	1

and	of	1	1

and	large	2	1

and	areas	3	1

and	of	4	1

and	rural	5	1

of	took	-5	1

of	control	-4	1

of	of	-3	1

of	barcelona	-2	1

of	and	-1	1

of	large	1	1

of	areas	2	1

of	of	3	1

of	rural	4	1

of	spain	5	1

large	control	-5	1

large	of	-4	1

large	barcelona	-3	1

large	and	-2	1

large	of	-1	1

large	areas	1	1

large	of	2	1

large	rural	3	1

large	spain	4	1

large	where	5	1

areas	of	-5	1

areas	barcelona	-4	1

areas	and	-3	1

areas	of	-2	1

areas	large	-1	1

areas	of	1	1

areas	rural	2	1

areas	spain	3	1

areas	where	4	1

areas	they	5	1

of	barcelona	-5	1

of	and	-4	1

of	of	-3	1

of	large	-2	1

of	areas	-1	1

was	this	-2	1

was	period	-1	1

was	the	1	1

was	confrontations	2	1

was	at	3	1

was	the	4	1

was	1999	5	1

the	event	-5	1

the	of	-4	1

the	this	-3	1

the	period	-2	1

the	was	-1	1

the	confrontations	1	1

the	at	2	1

the	the	3	1

the	1999	4	1

the	seattle	5	1

confrontations	of	-5	1

confrontations	this	-4	1

confrontations	period	-3	1

confrontations	was	-2	1

confrontations	the	-1	1

confrontations	at	1	1

confrontations	the	2	1

confrontations	1999	3	1

confrontations	seattle	4	1

confrontations	wto	5	1

at	this	-5	1

at	period	-4	1

at	was	-3	1

at	the	-2	1

at	confrontations	-1	1

at	the	1	1

at	1999	2	1

at	seattle	3	1

at	wto	4	1

at	conference	5	1

the	period	-5	1

the	was	-4	1

the	the	-3	1

the	confrontations	-2	1

the	at	-1	1

the	1999	1	1

the	seattle	2	1

the	wto	3	1

the	conference	4	1

the	anarchist	5	1

1999	was	-5	1

1999	the	-4	1

1999	confrontations	-3	1

1999	at	-2	1

1999	the	-1	1

1999	seattle	1	1

1999	wto	2	1

1999	conference	3	1

1999	anarchist	4	1

1999	idea


exist	and	-2	1

exist	traditions	-1	1

exist	and	1	1

exist	varieties	2	1

exist	of	3	1

exist	anarchy	4	1

exist	diverge	5	1

and	anarchist	-5	1

and	types	-4	1

and	and	-3	1

and	traditions	-2	1

and	exist	-1	1

and	varieties	1	1

and	of	2	1

and	anarchy	3	1

and	diverge	4	1

and	widely	5	1

varieties	types	-5	1

varieties	and	-4	1

varieties	traditions	-3	1

varieties	exist	-2	1

varieties	and	-1	1

varieties	of	1	1

varieties	anarchy	2	1

varieties	diverge	3	1

varieties	widely	4	1

varieties	one	5	1

of	and	-5	1

of	traditions	-4	1

of	exist	-3	1

of	and	-2	1

of	varieties	-1	1

of	anarchy	1	1

of	diverge	2	1

of	widely	3	1

of	one	4	1

of	reaction	5	1

anarchy	traditions	-5	1

anarchy	exist	-4	1

anarchy	and	-3	1

anarchy	varieties	-2	1

anarchy	of	-1	1

anarchy	diverge	1	1

anarchy	widely	2	1

anarchy	one	3	1

anarchy	reaction	4	1

anarchy	against	5	1

diverge	exist	-5	1

diverge	and	-4	1

diverge	varieties	-3	1

diverge	of	-2	1

diverge	anarchy	-1	1

diverge	widely	1	1

diverg


worker	with	2	1

worker	production	3	1

worker	and	4	1

worker	consumption	5	1

cooperatives	associations	-5	1

cooperatives	workers	-4	1

cooperatives	councils	-3	1

cooperatives	and	-2	1

cooperatives	worker	-1	1

cooperatives	with	1	1

cooperatives	production	2	1

cooperatives	and	3	1

cooperatives	consumption	4	1

cooperatives	based	5	1

with	workers	-5	1

with	councils	-4	1

with	and	-3	1

with	worker	-2	1

with	cooperatives	-1	1

with	production	1	1

with	and	2	1

with	consumption	3	1

with	based	4	1

with	on	5	1

production	councils	-5	1

production	and	-4	1

production	worker	-3	1

production	cooperatives	-2	1

production	with	-1	1

production	and	1	1

production	consumption	2	1

production	based	3	1

production	on	4	1

production	the	5	1

and	and	-5	1

and	worker	-4	1

and	cooperatives	-3	1

and	with	-2	1

and	production	-1	1

and	consumption	1	1

and	based	2	1

and	on	3	1

and	the	4	1

and	guiding	5	1

consumption	worker	-5	1

consumption	cooperatives	-4	1

consumption	with	


anarchist	academic	-3	1

anarchist	theory	-2	1

anarchist	various	-1	1

anarchist	groups	1	1

anarchist	tendencies	2	1

anarchist	and	3	1

anarchist	schools	4	1

anarchist	of	5	1

groups	over	-5	1

groups	academic	-4	1

groups	theory	-3	1

groups	various	-2	1

groups	anarchist	-1	1

groups	tendencies	1	1

groups	and	2	1

groups	schools	3	1

groups	of	4	1

groups	thought	5	1

tendencies	academic	-5	1

tendencies	theory	-4	1

tendencies	various	-3	1

tendencies	anarchist	-2	1

tendencies	groups	-1	1

tendencies	and	1	1

tendencies	schools	2	1

tendencies	of	3	1

tendencies	thought	4	1

tendencies	exist	5	1

and	theory	-5	1

and	various	-4	1

and	anarchist	-3	1

and	groups	-2	1

and	tendencies	-1	1

and	schools	1	1

and	of	2	1

and	thought	3	1

and	exist	4	1

and	today	5	1

schools	various	-5	1

schools	anarchist	-4	1

schools	groups	-3	1

schools	tendencies	-2	1

schools	and	-1	1

schools	of	1	1

schools	thought	2	1

schools	exist	3	1

schools	today	4	1

schools	making	5	1

of	anarchist

also	of	5	1

employed	but	-5	1

employed	some	-4	1

employed	of	-3	1

employed	them	-2	1

employed	also	-1	1

employed	terrorism	1	1

employed	as	2	1

employed	propaganda	3	1

employed	of	4	1

employed	the	5	1

terrorism	some	-5	1

terrorism	of	-4	1

terrorism	them	-3	1

terrorism	also	-2	1

terrorism	employed	-1	1

terrorism	as	1	1

terrorism	propaganda	2	1

terrorism	of	3	1

terrorism	the	4	1

terrorism	deed	5	1

as	of	-5	1

as	them	-4	1

as	also	-3	1

as	employed	-2	1

as	terrorism	-1	1

as	propaganda	1	1

as	of	2	1

as	the	3	1

as	deed	4	1

as	assassination	5	1

propaganda	them	-5	1

propaganda	also	-4	1

propaganda	employed	-3	1

propaganda	terrorism	-2	1

propaganda	as	-1	1

propaganda	of	1	1

propaganda	the	2	1

propaganda	deed	3	1

propaganda	assassination	4	1

propaganda	attempts	5	1

of	also	-5	1

of	employed	-4	1

of	terrorism	-3	1

of	as	-2	1

of	propaganda	-1	1

of	the	1	1

of	deed	2	1

of	assassination	3	1

of	attempts	4	1

of	were	5	1

the	employed	-5	1

the	terrorism	-4


both	other	-4	1

both	prominent	-3	1

both	anarchists	-2	1

both	afterwards	-1	1

both	bonanno	1	1

both	and	2	1

both	the	3	1

both	french	4	1

both	group	5	1

bonanno	other	-5	1

bonanno	prominent	-4	1

bonanno	anarchists	-3	1

bonanno	afterwards	-2	1

bonanno	both	-1	1

bonanno	and	1	1

bonanno	the	2	1

bonanno	french	3	1

bonanno	group	4	1

bonanno	the	5	1

and	prominent	-5	1

and	anarchists	-4	1

and	afterwards	-3	1

and	both	-2	1

and	bonanno	-1	1

and	the	1	1

and	french	2	1

and	group	3	1

and	the	4	1

and	invisible	5	1

the	anarchists	-5	1

the	afterwards	-4	1

the	both	-3	1

the	bonanno	-2	1

the	and	-1	1

the	french	1	1

the	group	2	1

the	the	3	1

the	invisible	4	1

the	committee	5	1

french	afterwards	-5	1

french	both	-4	1

french	bonanno	-3	1

french	and	-2	1

french	the	-1	1

french	group	1	1

french	the	2	1

french	invisible	3	1

french	committee	4	1

french	advocate	5	1

group	both	-5	1

group	bonanno	-4	1

group	and	-3	1

group	the	-2	1

group	french	-1	1

group	the

the	to	-2	1

the	play	-1	1

the	role	1	1

the	of	2	1

the	facilitator	3	1

the	to	4	1

the	help	5	1

role	the	-5	1

role	group	-4	1

role	to	-3	1

role	play	-2	1

role	the	-1	1

role	of	1	1

role	facilitator	2	1

role	to	3	1

role	help	4	1

role	achieve	5	1

of	group	-5	1

of	to	-4	1

of	play	-3	1

of	the	-2	1

of	role	-1	1

of	facilitator	1	1

of	to	2	1

of	help	3	1

of	achieve	4	1

of	a	5	1

facilitator	to	-5	1

facilitator	play	-4	1

facilitator	the	-3	1

facilitator	role	-2	1

facilitator	of	-1	1

facilitator	to	1	1

facilitator	help	2	1

facilitator	achieve	3	1

facilitator	a	4	1

facilitator	consensus	5	1

to	play	-5	1

to	the	-4	1

to	role	-3	1

to	of	-2	1

to	facilitator	-1	1

to	help	1	1

to	achieve	2	1

to	a	3	1

to	consensus	4	1

to	without	5	1

help	the	-5	1

help	role	-4	1

help	of	-3	1

help	facilitator	-2	1

help	to	-1	1

help	achieve	1	1

help	a	2	1

help	consensus	3	1

help	without	4	1

help	taking	5	1

achieve	role	-5	1

achieve	of	-4	1

achieve	facilitator	-3	1

achi


common	values	-5	1

common	ideology	-4	1

common	and	-3	1

common	tactics	-2	1

common	is	-1	1

common	its	1	1

common	diversity	2	1

common	has	3	1

common	led	4	1

common	to	5	1

its	ideology	-5	1

its	and	-4	1

its	tactics	-3	1

its	is	-2	1

its	common	-1	1

its	diversity	1	1

its	has	2	1

its	led	3	1

its	to	4	1

its	widely	5	1

diversity	and	-5	1

diversity	tactics	-4	1

diversity	is	-3	1

diversity	common	-2	1

diversity	its	-1	1

diversity	has	1	1

diversity	led	2	1

diversity	to	3	1

diversity	widely	4	1

diversity	different	5	1

has	tactics	-5	1

has	is	-4	1

has	common	-3	1

has	its	-2	1

has	diversity	-1	1

has	led	1	1

has	to	2	1

has	widely	3	1

has	different	4	1

has	uses	5	1

led	is	-5	1

led	common	-4	1

led	its	-3	1

led	diversity	-2	1

led	has	-1	1

led	to	1	1

led	widely	2	1

led	different	3	1

led	uses	4	1

led	of	5	1

to	common	-5	1

to	its	-4	1

to	diversity	-3	1

to	has	-2	1

to	led	-1	1

to	widely	1	1

to	different	2	1

to	uses	3	1

to	of	4	1

to	identical	5	1


the	free	5	1

jealousy	some	-5	1

jealousy	anarchists	-4	1

jealousy	struggled	-3	1

jealousy	with	-2	1

jealousy	the	-1	1

jealousy	that	1	1

jealousy	arose	2	1

jealousy	from	3	1

jealousy	free	4	1

jealousy	love	5	1

that	anarchists	-5	1

that	struggled	-4	1

that	with	-3	1

that	the	-2	1

that	jealousy	-1	1

that	arose	1	1

that	from	2	1

that	free	3	1

that	love	4	1

that	anarchist	5	1

arose	struggled	-5	1

arose	with	-4	1

arose	the	-3	1

arose	jealousy	-2	1

arose	that	-1	1

arose	from	1	1

arose	free	2	1

arose	love	3	1

arose	anarchist	4	1

arose	feminists	5	1

from	with	-5	1

from	the	-4	1

from	jealousy	-3	1

from	that	-2	1

from	arose	-1	1

from	free	1	1

from	love	2	1

from	anarchist	3	1

from	feminists	4	1

from	were	5	1

free	the	-5	1

free	jealousy	-4	1

free	that	-3	1

free	arose	-2	1

free	from	-1	1

free	love	1	1

free	anarchist	2	1

free	feminists	3	1

free	were	4	1

free	advocates	5	1

love	jealousy	-5	1

love	that	-4	1

love	arose	-3	1

love	from	-2	1

love	free	


later	state	-4	1

later	and	-3	1

later	ferrer	-2	1

later	was	-1	1

later	arrested	1	1

later	nonetheless	2	1

later	his	3	1

later	ideas	4	1

later	formed	5	1

arrested	state	-5	1

arrested	and	-4	1

arrested	ferrer	-3	1

arrested	was	-2	1

arrested	later	-1	1

arrested	nonetheless	1	1

arrested	his	2	1

arrested	ideas	3	1

arrested	formed	4	1

arrested	the	5	1

nonetheless	and	-5	1

nonetheless	ferrer	-4	1

nonetheless	was	-3	1

nonetheless	later	-2	1

nonetheless	arrested	-1	1

nonetheless	his	1	1

nonetheless	ideas	2	1

nonetheless	formed	3	1

nonetheless	the	4	1

nonetheless	inspiration	5	1

his	ferrer	-5	1

his	was	-4	1

his	later	-3	1

his	arrested	-2	1

his	nonetheless	-1	1

his	ideas	1	1

his	formed	2	1

his	the	3	1

his	inspiration	4	1

his	for	5	1

ideas	was	-5	1

ideas	later	-4	1

ideas	arrested	-3	1

ideas	nonetheless	-2	1

ideas	his	-1	1

ideas	formed	1	1

ideas	the	2	1

ideas	inspiration	3	1

ideas	for	4	1

ideas	a	5	1

formed	later	-5	1

formed	arrested	-4	1

formed	n

instead	people	2	1

instead	being	3	1

instead	able	4	1

instead	to	5	1

of	of	-5	1

of	its	-4	1

of	political	-3	1

of	tendencies	-2	1

of	instead	-1	1

of	people	1	1

of	being	2	1

of	able	3	1

of	to	4	1

of	control	5	1

people	its	-5	1

people	political	-4	1

people	tendencies	-3	1

people	instead	-2	1

people	of	-1	1

people	being	1	1

people	able	2	1

people	to	3	1

people	control	4	1

people	the	5	1

being	political	-5	1

being	tendencies	-4	1

being	instead	-3	1

being	of	-2	1

being	people	-1	1

being	able	1	1

being	to	2	1

being	control	3	1

being	the	4	1

being	aspects	5	1

able	tendencies	-5	1

able	instead	-4	1

able	of	-3	1

able	people	-2	1

able	being	-1	1

able	to	1	1

able	control	2	1

able	the	3	1

able	aspects	4	1

able	of	5	1

to	instead	-5	1

to	of	-4	1

to	people	-3	1

to	being	-2	1

to	able	-1	1

to	control	1	1

to	the	2	1

to	aspects	3	1

to	of	4	1

to	their	5	1

control	of	-5	1

control	people	-4	1

control	being	-3	1

control	able	-2	1

control	to	-1	1

contr


for	used	2	1

for	art	3	1

for	as	4	1

for	a	5	1

or	life	-5	1

or	other	-4	1

or	anarchists	-3	1

or	advocated	-2	1

or	for	-1	1

or	used	1	1

or	art	2	1

or	as	3	1

or	a	4	1

or	means	5	1

used	other	-5	1

used	anarchists	-4	1

used	advocated	-3	1

used	for	-2	1

used	or	-1	1

used	art	1	1

used	as	2	1

used	a	3	1

used	means	4	1

used	to	5	1

art	anarchists	-5	1

art	advocated	-4	1

art	for	-3	1

art	or	-2	1

art	used	-1	1

art	as	1	1

art	a	2	1

art	means	3	1

art	to	4	1

art	achieve	5	1

as	advocated	-5	1

as	for	-4	1

as	or	-3	1

as	used	-2	1

as	art	-1	1

as	a	1	1

as	means	2	1

as	to	3	1

as	achieve	4	1

as	anarchist	5	1

a	for	-5	1

a	or	-4	1

a	used	-3	1

a	art	-2	1

a	as	-1	1

a	means	1	1

a	to	2	1

a	achieve	3	1

a	anarchist	4	1

a	ends	5	1

means	or	-5	1

means	used	-4	1

means	art	-3	1

means	as	-2	1

means	a	-1	1

means	to	1	1

means	achieve	2	1

means	anarchist	3	1

means	ends	4	1

means	in	5	1

to	used	-5	1

to	art	-4	1

to	as	-3	1

to	a	-2	1

to	means	-1	1

to	achiev

colin	anarchist	5	1

ward	entities	-5	1

ward	can	-4	1

ward	be	-3	1

ward	self-governing	-2	1

ward	colin	-1	1

ward	responds	1	1

ward	that	2	1

ward	major	3	1

ward	anarchist	4	1

ward	thinkers	5	1

responds	can	-5	1

responds	be	-4	1

responds	self-governing	-3	1

responds	colin	-2	1

responds	ward	-1	1

responds	that	1	1

responds	major	2	1

responds	anarchist	3	1

responds	thinkers	4	1

responds	advocated	5	1

that	be	-5	1

that	self-governing	-4	1

that	colin	-3	1

that	ward	-2	1

that	responds	-1	1

that	major	1	1

that	anarchist	2	1

that	thinkers	3	1

that	advocated	4	1

that	federalism	5	1

major	self-governing	-5	1

major	colin	-4	1

major	ward	-3	1

major	responds	-2	1

major	that	-1	1

major	anarchist	1	1

major	thinkers	2	1

major	advocated	3	1

major	federalism	4	1

major	philosophy	5	1

anarchist	colin	-5	1

anarchist	ward	-4	1

anarchist	responds	-3	1

anarchist	that	-2	1

anarchist	major	-1	1

anarchist	thinkers	1	1

anarchist	advocated	2	1

anarchist	federalism	3	1



ferguson	frances	-2	1

ferguson	l	-1	1

ferguson	joseph	1	1

ferguson	raz	2	1

ferguson	argues	3	1

ferguson	that	4	1

ferguson	the	5	1

joseph	review	-5	1

joseph	by	-4	1

joseph	frances	-3	1

joseph	l	-2	1

joseph	ferguson	-1	1

joseph	raz	1	1

joseph	argues	2	1

joseph	that	3	1

joseph	the	4	1

joseph	acceptance	5	1

raz	by	-5	1

raz	frances	-4	1

raz	l	-3	1

raz	ferguson	-2	1

raz	joseph	-1	1

raz	argues	1	1

raz	that	2	1

raz	the	3	1

raz	acceptance	4	1

raz	of	5	1

argues	frances	-5	1

argues	l	-4	1

argues	ferguson	-3	1

argues	joseph	-2	1

argues	raz	-1	1

argues	that	1	1

argues	the	2	1

argues	acceptance	3	1

argues	of	4	1

argues	authority	5	1

that	l	-5	1

that	ferguson	-4	1

that	joseph	-3	1

that	raz	-2	1

that	argues	-1	1

that	the	1	1

that	acceptance	2	1

that	of	3	1

that	authority	4	1

that	implies	5	1

the	ferguson	-5	1

the	joseph	-4	1

the	raz	-3	1

the	argues	-2	1

the	that	-1	1

the	acceptance	1	1

the	of	2	1

the	authority	3	1

the	implies	4	1

the	the	5	1

ac

### Step 2: Shuffler

All we need to do in the shuffler is sorting, so let's use the built-in command to do this for us!  

Try this on your terminal/command prompt ;)  
(You can get the sample input from [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing))

**Unix**  
```bash
sort -k1,3 < mapper.sample.tsv
```
**Windows**
```powershell
type mapper.sample.tsv | sort
```

### Step 3: Reducer

Since all the input should have been sorted in previous shuffler, the task of reducer is pretty simple: just count how many times the same gram appears, and then print the count out!

Input: 
 - `"{pivot}\t{word}\t{distance}\t{count}"`
 - You can get a sample input file `shuffler.sample.tsv` from [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing)

Output: 
 - `"{pivot}\t{word}\t{total_freq}\t{-5}\t{-4}\t{-3}\t{-2}\t{-1}\t{1}\t{2}\t{3}\t{4}\t{5}"`
 - The first two column are the skipgram; the third column is the sum of total frequency; column 4\~13 are the frequency with distance -5\~5, without 0.

Example:
 - `arouse  open    4       0       0       3       0       0       0       0       0       0       1`

Hints: 
1. Parse the input from shuffler
2. Check if this is the same skipgram as the previous one
3. If so, add the frequency according to its distance
4. If not, output the previous skipgram data

Note that you may NOT want to store all your counting results in a dict or any data structure.  
Recall that one purpose of MapReduce is to prevent memory exhaustion. It loses its value if you end up storing it again.  
Instead, <u>directly print it out or write it into a file</u> .  
(Don't get me wrong: of course you can store some temporary data, but let's not store the whole result and then print it out at once, okay?)


In [81]:
dis_range = 5
shlen = 10

with open(os.path.join('data', 'shuffler.sample.tsv')) as f:
    fw = open("reduce_test_ipynb.tsv", "w")
    tmp_test = 0 
    pivot_word = skip_word = ""
    pivot_cnt = 0
    pivot_list = [0] * shlen
    
    for line in f:
        # 1) Parse the input from shuffler
        # 2) Check if this is the same skipgram
        # 3) If so, add the frequency according to its distance
        # 4) If not, output the previous skipgram data
        line = line.replace("\n", "")
        Regtmp = line.split("\t")
        if len(pivot_word) == 0 and len(skip_word) == 0:
            pivot_word = Regtmp[0]
            skip_word = Regtmp[1]
               
        elif pivot_word == Regtmp[0] and skip_word == Regtmp[1]:
            a = 1
                
        else:
            tmp_print = pivot_word + "\t" + skip_word + "\t" + str(pivot_cnt) + "\t" 
            for i in range(shlen):
                tmp_print += str(pivot_list[i]) + "\t"
            if tmp_test < 30:
                print(tmp_print)
            fw.write(tmp_print+"\n")
            pivot_word = Regtmp[0]
            skip_word = Regtmp[1]
            pivot_cnt = 0
            pivot_list = [0] * shlen
            
        pivot_cnt += int(Regtmp[3])    
        if int(Regtmp[2]) < 0:
            pivot_list[int(Regtmp[2])+dis_range] += 1
        else:
            pivot_list[int(Regtmp[2])+dis_range-1] += 1
        
        tmp_test += 1
    fw.close()

1539	a	1	0	0	0	0	0	0	0	0	0	1	
1539	anarchisme	1	0	1	0	0	0	0	0	0	0	0	
1539	anarchy	1	0	0	0	1	0	0	0	0	0	0	
1539	and	1	0	0	1	0	0	0	0	0	0	0	
1539	as	1	1	0	0	0	0	0	0	0	0	0	
1539	early	1	0	0	0	0	0	1	0	0	0	0	
1539	empahised	1	0	0	0	0	0	0	0	0	1	0	
1539	english	1	0	0	0	0	0	0	1	0	0	0	
1539	from	1	0	0	0	0	1	0	0	0	0	0	
1539	usages	1	0	0	0	0	0	0	0	1	0	0	
1642	anarchism	1	1	0	0	0	0	0	0	0	0	0	
1642	anarchisme	1	0	0	0	0	0	0	1	0	0	0	
1642	anarchy	1	0	0	0	0	0	0	0	0	1	0	
1642	and	1	0	0	0	0	0	0	0	1	0	0	
1642	appears	1	0	1	0	0	0	0	0	0	0	0	
1642	as	1	0	0	0	0	0	1	0	0	0	0	
1642	english	1	0	0	0	1	0	0	0	0	0	0	
1642	from	2	0	0	0	0	1	0	0	0	0	1	
1642	in	1	0	0	1	0	0	0	0	0	0	0	
1756	1808	1	0	0	0	0	0	0	0	0	0	1	
1756	1836	1	0	0	0	0	0	1	0	0	0	0	
1756	and	1	0	0	0	0	0	0	1	0	0	0	
1756	as	1	0	0	1	0	0	0	0	0	0	0	
1756	century	1	1	0	0	0	0	0	0	0	0	0	
1756	godwin	1	0	0	0	0	1	0	0	0	0	0	
1756	such	1	0	1	0	0	0	0	0	0	0	0	
1756	weitling	1	0	0	0	0	0	0	0	0	1	0	
1756	wilhelm	1	0	0	0	0	0	0	0	1	0	0	


### Step 4: Combine them together!  

Now you can move your code above into mapper.py and reducer.py (with some tiny modifications, of course), and this is your assignment this week!   
See below for detailed requirement description.  

**Hints: What should I modify in my mapper and reducer?**  

1. Receive/pass data from standard I/O, rather than the file (We've already done this for you)
2. Process with the whole dataset, rather than only the first line

That's it!  

The processing takes some times (~1hr w/o parallel computing), so go enjoy some coffee or movies (or sleep) during the waiting time ;)

<a name="Assignment-Requirement"></a>
## Assignment Requirement 

1. You need to implement the `mapper.py` and `reducer.py` to calculate the skip-gram table.

2. In `mapper.py`, you need to generate skipgrams with distance within -5 to 5 (inclusive).  
   - Input: Pure text file (`wiki1G.txt`) with each line as a wikipage.
   - Output: `"{pivot}\t{word}\t{distance}\t{count}"`
   - Example: 
     ```
     predict is  -3  1
     predict used    -2  1
     predict the -1  1
     predict the 1   1
     ...
     ```
   - Sample output: `mapper.sample.tsv` (Find it [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing); no need to be exactly the same)

3. In `reducer.py`, you have to collect the output from the shuffler (`sort`) and generate the skip-gram table.
   - Input: `"{pivot}\t{word}\t{distance}\t{count}"`
   - Output: 
     - `"{pivot}\t{word}\t{total}\t{-5}\t{-4}\t{-3}\t{-2}\t{-1}\t{1}\t{2}\t{3}\t{4}\t{5}"`
     - The first two column are the skipgram; the third column is the sum of total frequency; column 4\~13 are the frequency with distance -5\~5, without 0.
   - Example:
     ```
     arouse  of      1       0       0       0       0       0       0       0       1       0       0
     arouse  open    4       0       0       3       0       0       0       0       0       0       1
     arouse  so      2       0       1       0       0       0       0       0       0       1       0
     arouse  sufficiently    1       0       1       0       0       0       0       0       0       0       0
     ...
     ```
   - Sample output: `reducer.sample.tsv` (Find it [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing); no need to be exactly the same)

4. Concate your MapReduce procedure and generate the skip-gram on wiki1G dataset
   - Unix: 
     - Use the [local map-reduce tool](https://github.com/dspp779/local-mapreduce) (faster),
     - or run it directly: `python mapper.py < wiki1G.txt | sort -k1,2 -k3n | python reducer.py > skipgram.tsv` (slower)
   - Windows: 
     - CMD: `python mapper.py < wiki1G.txt | sort | python reducer.py > skipgram.tsv`
     - PS: `type wiki1G.txt | python mapper.py | sort | python reducer.py > skipgram.tsv`
     - or the bash environment you installed last week.  
   - See [Appendix](#built-in-command) if you want to know what these commands mean

During the demo, you need to 

1. show us your skip-gram result on the given dataset, and
2. explain your implementation in `mapper.py` and `reducer.py`.  

Note that the final result would be a large file (~6 GB), so **you may want to show it with `more` or `less` command**.  

## TA's note

Congratulations! You've learned how to calculate skipgram frequency and to deal with a huge dataset with MapReduce technique.  

Remember to <b><a href="https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit?usp=sharing">make an appoiment with TA</a> to demo/explain your implementation <u>before <font color="red">10/21 15:30</font></u></b> .  
You should also submit your `mapper.py` and `reducer.py` to <a href="https://eeclass.nthu.edu.tw/course/homework/3285">eeclass</a> .

<a name="built-in-command"></a>
## Appendix: useful built-in commands

Several built-in commands are very useful in the MapReducer procedure.  
Here we introduce `cat` and `type`, `<` and `>`, `sort`, and pipe `|`.  

### cat (on Unix)
`cat` command, which is definitly not indicating some cute creatures (*meow~*), is the abbreviation of `concatenate`. ([doc](https://man7.org/linux/man-pages/man1/cat.1.html))   

When you `cat` a file, it means you want to print the content from a file (or some files) to standard output.  
Now open your bash and test the command below!  
```bash
cat file.txt
```

You should see something like this: 

![picture](https://i.imgur.com/Z9shOYQ.png)

### type (on Windows)
`type` command works exactly the same as `cat` on Unix, but without its cute nickname (Shame on you, Windows). ([doc](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/type))  

Similarly, if you `type` a file, it means you print the content from a file (or some files) to standard output.  

```powershell
type file.txt
```

You should see something like this:  
![](https://i.imgur.com/5WFhxkq.png)

### `>`? `<`? `>///<`? 

`<` and `>` are the I/O redirections.  
`program < filename` means that you want to redirect the input from a file to a program, while `program > filename` means that you want to redirect the output of a program to that file.  

For example,
```bash
echo "hello world" > greet.txt
```
writes the string "hello world" into a file `greet.txt`.  

On the other hand, 
```bash
head < greet.txt
```
makes `head` receive the content from `greet.txt`, so it will print out the string in `greet.txt`.  
![](https://i.imgur.com/swxv8LG.png)
<small>p.s. `>///<` is just a joke. Don't take it seriously.</small>

### sort

As its name suggests, `sort` sorts the data that it receives. (doc on [Linux](https://man7.org/linux/man-pages/man1/sort.1.html) and on [Windows](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/sort))  
Try this:
```
sort sample.txt
```
You can see that the content has been sorted before printed onto your screen.  
![](https://i.imgur.com/QFEq3Tc.png)

### Pipe `|`

Pipe passes the output from previous program to the next program.  
For example, 
```bash
python program.py | sort
```
will pass the output of `program.py` to `sort` command.  

In [11]:
import os
from string import punctuation
flag = 0
with open(os.path.join('data', 'skipgram.tsv'), encoding="utf-8") as f:
    tmp = 0
    for line in f:
        if (line[0] == "a"):
            flag = 1
        if flag == 1:
            print(line)
            tmp += 1
            if tmp > 10000:
                break;

a	0	3626	305	256	294	318	80	295	1043	351	420	264

a-0	0	1	0	0	0	0	0	0	0	0	0	1

a"=0	0	2	0	0	0	0	1	0	0	1	0	0

a"/0	0	1	0	0	0	0	0	0	0	1	0	0

a	00	32	4	2	4	4	1	3	1	5	7	1

a	0-0	5	0	1	0	0	0	2	2	0	0	0

a	0,0	3	0	1	0	1	0	0	0	0	1	0

a	0/0	2	1	0	1	0	0	0	0	0	0	0

a	0.0	27	2	3	4	7	0	2	0	2	6	1

a	0+0	1	0	0	0	0	0	0	0	0	0	1

a0	0	1	0	0	1	0	0	0	0	0	0	0

a	000	80	8	13	11	5	1	1	3	1	21	16

a	0,0,0	1	0	0	0	0	0	0	0	0	0	1

a	0:00	1	0	0	0	0	1	0	0	0	0	0

a	0.00	6	1	0	0	1	0	1	0	2	0	1

a	0000	13	3	1	1	3	1	0	1	1	1	1

a	0.000	3	1	1	0	0	0	0	0	1	0	0

a	0.0.0.0	3	1	0	0	0	0	0	0	0	1	1

a	00:00	3	1	1	0	0	0	0	0	0	0	1

a	00000	1	0	0	0	0	0	0	0	0	0	1

a	000000	1	1	0	0	0	0	0	0	0	0	0

a	00000000-0000-0000-0000-000000000000	1	0	1	0	0	0	0	0	0	0	0

a	0.000000065	2	0	1	0	0	0	0	1	0	0	0

a	0.00000026	1	0	1	0	0	0	0	0	0	0	0

a	0.0000005	1	1	0	0	0	0	0	0	0	0	0

a000:0000	address	1	0	0	0	1	0	0	0	0	0	0

a000:0000	and	1	0	0	0	0	0	1	0	0	0	0

a000:0000	b000:ffff	1	0	0	0	0	0	0	1	0	0	0

a000:0000	in	1	0	0	0	0	0	0	0	1	0	0

a000:0000	mode	1	

a	1.0016	1	0	0	0	0	0	0	0	0	0	1

a	100180	1	0	0	1	0	0	0	0	0	0	0

a	100,182	2	0	0	1	0	0	0	0	1	0	0

a	100,189	1	1	0	0	0	0	0	0	0	0	0

a	10,019	1	0	1	0	0	0	0	0	0	0	0

a-100	1959	1	0	1	0	0	0	0	0	0	0	0

a100	1963	1	0	0	0	1	0	0	0	0	0	0

a-100	1965	1	0	0	0	0	0	0	0	1	0	0

a1001	a1057	1	0	0	1	0	0	0	0	0	0	0

a1001	and	1	0	0	0	1	0	0	0	0	0	0

a1001	bishops	1	0	0	0	0	0	0	0	1	0	0

a1001	bridge	1	0	0	0	0	0	0	0	0	0	1

a1001	business	1	0	0	0	0	0	0	0	0	0	1

a1001	each	1	0	1	0	0	0	0	0	0	0	0

a1001	near	1	0	0	0	0	0	0	1	0	0	0

a1001	of	1	1	0	0	0	0	0	0	0	0	0

a1001	on	2	0	0	0	1	0	1	0	0	0	0

a1001	other	1	0	0	1	0	0	0	0	0	0	0

a	1,001-piano	1	0	0	0	0	0	1	0	0	0	0

a1001	polaroid	1	0	0	0	0	0	1	0	0	0	0

a1001	railway	1	0	0	0	0	0	0	0	0	1	0

a1001	square	1	0	0	0	0	0	0	0	0	1	0

a1001	the	5	0	1	0	0	2	0	1	1	0	0

a1001	to	1	1	0	0	0	0	0	0	0	0	0

a	1002	18	1	4	4	3	1	0	0	3	2	0

a	1,002	4	2	0	0	0	0	0	0	1	1	0

a	1.002	1	0	1	0	0	0	0	0	0	0	0

a	10:02	1	0	0	0	1	0	0	0	0	0	0

a	10.02	2	0	0	0	0	0	0	0	2	0	0

a	100.2	4	0	0	0	0	0	0	0


a	109,371	1	0	0	0	0	0	0	0	0	1	0

a	109,375	2	2	0	0	0	0	0	0	0	0	0

a	109,376	2	0	0	1	0	0	0	0	1	0	0

a	1094	9	1	2	2	0	1	0	0	1	0	2

a	1,094	5	1	0	0	0	0	0	0	4	0	0

a	109.4	1	1	0	0	0	0	0	0	0	0	0

a	10,941	1	0	0	0	0	0	0	0	1	0	0

a	109,410	2	0	0	1	0	0	0	0	1	0	0

a	10,942	1	0	0	0	0	0	0	0	1	0	0

a	109,424	1	0	0	0	0	0	0	0	0	1	0

a	109,428	2	0	0	1	0	0	0	0	1	0	0

a	109.443782	1	0	0	0	1	0	0	0	0	0	0

a	109,459	2	0	0	1	0	0	0	0	1	0	0

a	10,947	1	0	0	0	0	0	0	0	0	0	1

a	109,472	1	0	0	0	0	0	0	0	0	0	1

a	1094/95	1	0	0	0	0	1	0	0	0	0	0

a	109,496	2	0	0	1	0	0	0	0	1	0	0

a1094	b1122	1	0	0	0	0	0	0	0	1	0	0

a1094	benhall	1	0	0	1	0	0	0	0	0	0	0

a1094	by	1	0	0	0	1	0	0	0	0	0	0

a1094	in	1	0	1	0	0	0	0	0	0	0	0

a1094	leads	1	0	0	0	0	0	0	0	0	1	0

a	109.4-meter	1	0	0	0	0	0	0	1	0	0	0

a1094	road	1	0	0	0	0	0	1	0	0	0	0

a	1,094-seat	1	0	0	0	0	0	1	0	0	0	0

a1094	street	1	1	0	0	0	0	0	0	0	0	0

a1094	the	2	0	0	0	0	1	0	1	0	0	0

a1094	to	1	0	0	0	0	0	0	0	0	0	1

a	1095	27	3	7	5	3	4	0	0	1	2	2

a	1,095	2	0	1	0	0	0	0	0	1	0	0

a	1:

a1152	near	1	0	0	0	0	0	1	0	0	0	0

a1152	off	1	0	0	0	1	0	0	0	0	0	0

a1152	raf	1	0	0	0	0	0	0	0	0	1	0

a1152	rendlesham	1	0	1	0	0	0	0	0	0	0	0

a1152	the	2	0	0	0	0	1	0	1	0	0	0

a	1153	25	4	5	4	7	2	0	0	3	0	0

a	1,153	5	0	1	0	0	0	0	0	4	0	0

a	11.53	1	1	0	0	0	0	0	0	0	0	0

a	115.3	1	0	0	0	0	0	0	1	0	0	0

a	11530	3	1	0	0	0	1	0	0	0	1	0

a	115,331	1	0	0	0	0	0	0	0	1	0	0

a	115,341	1	1	0	0	0	0	0	0	0	0	0

a	115,382	1	1	0	0	0	0	0	0	0	0	0

a	115,385	2	0	0	1	0	0	0	0	1	0	0

a	1154	29	6	3	3	6	4	0	0	3	4	0

a	1,154	4	0	1	1	1	0	0	0	1	0	0

a	1.1.5.4	2	0	0	0	0	1	0	0	0	1	0

a	11,541	1	0	0	0	0	0	0	0	1	0	0

a	1,154,116	1	0	0	0	0	0	0	0	1	0	0

a	115,439	2	0	0	1	0	0	0	0	1	0	0

a	11,544	1	0	0	1	0	0	0	0	0	0	0

a	11,546	1	0	1	0	0	0	0	0	0	0	0

a	1155	21	3	3	6	5	0	0	0	0	2	2

a	1,155	5	1	0	1	0	0	0	0	2	0	1

a	11550	1	0	0	0	0	0	0	0	1	0	0

a	115,500,000	1	0	0	0	0	0	1	0	0	0	0

a	115,513	1	0	0	1	0	0	0	0	0	0	0

a	11553	2	0	0	0	1	0	0	0	1	0	0

a	11,553	1	0	0	1	0	0	0	0	0	0	0

a	11,553,427	1	0	0	0	0	0	0	0	1	0	0

a	115,541	2	0	0	1	0