<a href="https://colab.research.google.com/github/raj-vijay/nl/blob/master/01_NLP_Unix_Tools_and_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Natural Language Processing - Lab #1 and #2: Unix Tools and Regular Expressions**

<p align = 'justify'>This Lab is about the use of regular expressions (regex) and a set of Unix tools for quick text processing. Section III below has a set of questions. You should include the commands and the result of applying the commands by copying and pasting from the terminal.</p>


**II. Before Starting**

A. The United Nations Corpus

<p align = 'justify'>In this assignment, you will make use of the United Nations (UNCorpus), a corpus on the UN general assembly resolutions. The UNCorpus is a six-language parallel text in Arabic, Chinese, English, French, Russian and Spanish. The following paper describes the corpus:</p>

<p align = 'justify'>Alexandre Rafalovitch and Robert Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292-299, Ottawa, Canada.</p>


In [None]:
!wget https://web.archive.org/web/20180831123202/http:/www.uncorpora.org/files/uncorpora_plain_20090831.zip

--2021-04-06 21:31:15--  https://web.archive.org/web/20180831123202/http:/www.uncorpora.org/files/uncorpora_plain_20090831.zip
Resolving web.archive.org (web.archive.org)... 207.241.237.3
Connecting to web.archive.org (web.archive.org)|207.241.237.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘uncorpora_plain_20090831.zip’

uncorpora_plain_200     [           <=>      ]  40.93M  1.30MB/s    in 45s     

2021-04-06 21:32:02 (928 KB/s) - ‘uncorpora_plain_20090831.zip’ saved [42916526]



<p align = 'justify'>Unzipping the file produces a text file named uncorpora_plain_20090831.tmx. This file will be referred to as the UNCorpus in the rest of this document.</p>

In [None]:
!unzip /content/uncorpora_plain_20090831.zip

Archive:  /content/uncorpora_plain_20090831.zip
  inflating: uncorpora_plain_20090831.tmx  


**Unix Tools**

<p align = 'justify'>Revise the usage of the following Unix commands (and some of their specific options), which you will need in this assignment: cat, wc, sort (sort –nr), uniq (uniq –c), grep (grep –e; grep – a), comm, and more (the command that is). You can use the man command to check the usage from any Unix terminal (eg man cat). You can also check this online man page: 


http://unixhelp.ed.ac.uk/alphabetical/. 

Other Unix commands you may want to consider checking are: less, tr and sed. 

Additionally, revise the use of the pipeline and I/O redirections (| and >, specifically). 

For a quick introduction, see http://www.westwind.com/reference/os-x/commandline/pipes.html.

Finally, we recommend using the PERL interpreter in a Unix command pipeline mode to apply regex substitutions: perl –pe ‘<substitute-regex>;’ or python -c <stuff> . It is much more powerful than sed or tr commands.

In [None]:
!cat uncorpora_plain_20090831.tmx |perl -pe 'tr/[A-Z]/[a-z]/;'|more

<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4b">
  <header segtype="paragraph" creationtoolversion="1.0" srclang="en" creationtoo
l="oresaligner" datatype="plaintext" o-tmf="ores" adminlang="en-us"/>
  <body>
    <tu tuid="55_100:6">
      <prop type="session">55</prop>
      <prop type="committee">3</prop>
      <tuv xml:lang="en">
        <seg>resolution 55/100</seg>
      </tuv>
      <tuv xml:lang="ar">
        <seg>القرار 55/100</seg>
      </tuv>
      <tuv xml:lang="zh">
        <seg>第55/100号决议</seg>
      </tuv>
      <tuv xml:lang="fr">
        <seg>rÉsolution 55/100</seg>
      </tuv>
      <tuv xml:lang="ru">
        <seg>РЕЗОЛЮЦИЯ 55/100</seg>
      </tuv>
[K^C


**Regular Expressions**

<p align = 'justify'>Revise the regular expression definitions in Chapter 2 in J+M Book. There is a cheat sheet in the inside cover of the book. Here is another link to a different cheat sheet also:</p> 

http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf

**Questions**

#### Q1: The Full UNCorpus 

<p align = 'justify'>Answer the following questions using Unix commands and regex only. Each question should be answered with one command line (possibly consisting of multiple piped Unix commands)</p>

1. How many lines does the UNCorpus file have?

2. How many segments <seg>?

In [None]:
!grep '<seg>' uncorpora_plain_20090831.tmx | wc -l

434034


3. How many non-segments? As in tags that are not <seg> like <tuv>?

In [None]:
!grep '<*>' uncorpora_plain_20090831.tmx | grep -v '<seg>' uncorpora_plain_20090831.tmx | wc -l 

1067282


3a) What percentage of the the file size is text vs xml?

4. How many English segments does the text have?

In [None]:
!cat uncorpora_plain_20090831.tmx |grep "xml:lang=\"EN\"" |wc -l

72339


5. How many segments exist for each languages (Chinese, Arabic,...)? (again, done in one command)


In [None]:
!cat uncorpora_plain_20090831.tmx |grep "xml:lang=\"..\"" |sort |uniq -c|sort -nr

  72339       <tuv xml:lang="ZH">
  72339       <tuv xml:lang="RU">
  72339       <tuv xml:lang="FR">
  72339       <tuv xml:lang="ES">
  72339       <tuv xml:lang="EN">
  72339       <tuv xml:lang="AR">


#### Q2: The English UNCorpus

<p align = 'justify'>Answer the following questions using Unix commands and regex only. Each question should be answered with one command line (possibly consisting of multiple piped Unix commands)</p>

1. Extract the text without XML for only the English segments and put in a file called “uncorpus.eng.txt” (Hint, use “grep –a1”). The rest of the questions are about this file. How would you verify that you did not miss any lines?


In [None]:
!cat uncorpora_plain_20090831.tmx |grep "\band\b"|wc

  49036 2327136 16607612


In [None]:
!cat uncorpora_plain_20090831.tmx |grep "and"|wc

  86480 4456661 31323758


In [None]:
!grep -a1 "lang=\"EN\"" uncorpora_plain_20090831.tmx |grep "<seg>"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        <seg>The least developed countries: Afghanistan, Angola, Bangladesh, Benin, Bhutan, Burkina Faso, Burundi, Cambodia, Cape Verde, Central African Republic, Chad, Comoros, Democratic Republic of the Congo, Djibouti, Equatorial Guinea, Eritrea, Ethiopia, Gambia, Guinea, Guinea-Bissau, Haiti, Kiribati, Lao People's Democratic Republic, Lesotho, Liberia, Madagascar, Malawi, Maldives, Mali, Mauritania, Mozambique, Myanmar, Nepal, Niger, Rwanda, Samoa, Sao Tome and Principe, Sierra Leone, Solomon Islands, Somalia, Sudan, Togo, Tuvalu, Uganda, United Republic of Tanzania, Vanuatu, Yemen, Zambia</seg>
        <seg>RESOLUTION 55/236</seg>
        <seg>Adopted at the 89th plenary meeting, on 23 December 2000, without a vote, on the recommendation of the Committee (A/55/712, para. 10)</seg>
        <seg>55/236. Voluntary movements in connection with the apportionment of the expenses of United Nations peacekeeping operations</

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        <seg>2. Reaffirms once again that the existence of colonialism in any form or manifestation, including economic exploitation, is incompatible with the Charter of the United Nations, the Declaration on the Granting of Independence to Colonial Countries and Peoples and the Universal Declaration of Human Rights;</seg>
        <seg>3. Reaffirms its determination to continue to take all steps necessary to bring about the complete and speedy eradication of colonialism and the faithful observance by all States of the relevant provisions of the Charter, the Declaration on the Granting of Independence to Colonial Countries and Peoples and the Universal Declaration of Human Rights;</seg>
        <seg>4. Affirms once again its support for the aspirations of the peoples under colonial rule to exercise their right to self-determination, including independence, in accordance with relevant resolutions of the United Nations on de

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
!grep -a1 "lang=\"EN\"" uncorpora_plain_20090831.tmx |grep "<seg>" |perl -pe 's/\s*<\/?seg>//g;'|wc

  72339 2685538 18008957


In [None]:
!cat uncorpora_plain_20090831.tmx eng |perl -pe 's/ /\n/g;'|grep -v "[0-z]"|sort|uniq -c |sort -nr

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
      1 المعنوي،
      1 المعنون"حقوق
      1 المعنون"حالة
      1 المعنونتان
      1 المعنــــــون
      1 المـعنون
      1 المعمدانية
      1 المعلومة،
      1 المعلومات()،وتوافق
      1 المعلومات؛
      1 المعلومات"
      1 المعلومــات،
      1 المعلَّقة
      1 المعلّقة
      1 المعقولة،
      1 المعقولة"
      1 المعقــــــودة
      1 المعقــــودة
      1 المعقـــــود
      1 المعقــــود
      1 المعقـود
      1 المعـقـود
      1 المعـقود
      1 المـعـقـود
      1 (المعقود
      1 المعقمة،
      1 المعقدة"()،
      1 المعقدة''()
      1 المعززة،
      1 المعزِّزة
      1 المعزز.
      1 المعروضـة
      1 المعــروض
      1 المعرقلة
      1 المعرفة"،
      1 المعرف
      1 المعرّضين
      1 المعـرَّضــة
      1 المعديـــة،
      1 المعدنية
      1 المعدِّل
      1 المعدّة
      1 المعدّ
      1 المعجّلة
      1 المعجل،
      1 المعجَّل
      1 المعتمدون،
      1 المعتمدة.
      1 المعتمــدة
      1 المعتمدان
      1 ا

In [None]:
!cat uncorpora_plain_20090831.tmx |perl -pe 's/ /\n/g;'|egrep "(.)\1"|wc

1309730 1309804 21042376


In [None]:
!cat uncorpora_plain_20090831.tmx  |perl -pe 's/ /\n/g;'|egrep "(.)\1\1"|egrep "[iIxXvVcCmMLl]"|wc

   4977    4977  123064
