# MIT's missing semester - Data Wrangling

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/rwcitek/MyBinder.demo/main?labpath=%2FRegular.Expressions%2Fmit.data-wrangling.ipynb)

This notebook covers the exercises from this lecture:

https://missing.csail.mit.edu/2020/data-wrangling/

In particular, the [Exercises](https://missing.csail.mit.edu/2020/data-wrangling/#:~:text=Exercises)

## Exercise 2

In [1]:
# Because this can give an error, we need to run the next cell
cat /usr/share/dict/words |
rev |
cut -c1-2 |
rev |
wc -l

rev: stdin: Invalid or incomplete multibyte or wide character
10393
cut: write error: Broken pipe


In [2]:
# Clean up the data as the 'words' file may have some multi-byte words 
LC_ALL=C
cat /usr/share/dict/words |
tr -dc '\0-\177' |
egrep -i "^[a-z']+$" > /tmp/words.clean
wc -l /usr/share/dict/words /tmp/words.clean 

  654749 /usr/share/dict/words
  654749 /tmp/words.clean
 1309498 total


In [3]:
# No more error, yet the files are the same size ... odd.
cat /tmp/words.clean |
rev |
cut -c1-2 |
rev |
wc -l

654749


In [4]:
cat /tmp/words.clean |
tr 'A' 'a' |              # convert all upercase A's to lower case a's
grep 'a.*a.*a' |          # search for three a's, not necessarily in a row
grep -v "'s$" |           # exclude words ending in 's
rev |                     # reverse characters in each word ( i.e. line )
cut -c1-2 |               # cut the first two characters
rev |                     # reverse charasters in each line
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

   1279 ia
   1273 al
   1185 an


In [5]:
cat /tmp/words.clean |
tr 'A' 'a' |              # convert all upercase A's to lower case a's
grep 'a.*a.*a' |          # search for three a's, not necessarily in a row
grep -v "'s$" |           # exclude words ending in 's
sed -re 's/^.*(..)$/\1/'| # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

   1279 ia
   1273 al
   1185 an


In [6]:
cat /tmp/words.clean |
grep -i 'a.*a.*a' |       # case insensitive search for three a's, not necessarily in a row
grep -i -v "'s$" |        # exclude words ending in 's
sed -re 's/^.*(..)$/\1/'| # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

   1279 ia
   1273 al
   1185 an


In [7]:
cat /tmp/words.clean |
grep -i 'a.*a.*a' |       # case insensitive search for three a's, not necessarily in a row
grep -i -v "'s$" |        # exclude words ending in 's
sed -E 's/^.*(..)$/\1/'|  # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

   1279 ia
   1273 al
   1185 an
