# MIT's missing semester - Data Wrangling

This notebook covers the exercises from this lecture:

https://missing.csail.mit.edu/2020/data-wrangling/

In particular, the [Exercises](https://missing.csail.mit.edu/2020/data-wrangling/#:~:text=Exercises)

## Exercise 2

### Set up

In [None]:
# Because this can give an error, we need to run the next cell
cat /usr/share/dict/words |
rev |
cut -c1-2 |
rev |
wc -l

In [None]:
# Clean up the data as the 'words' file may have some multi-byte words 
LC_ALL=C
cat /usr/share/dict/words |
tr -dc '\0-\177' |
egrep -i "^[a-z']+$" > /tmp/words.clean
wc -l /usr/share/dict/words /tmp/words.clean 

In [None]:
# No more error, yet the files are the same size ... odd.
cat /tmp/words.clean |
rev |
cut -c1-2 |
rev |
wc -l

### Question 1

In [None]:
cat /tmp/words.clean |
tr 'A' 'a' |              # convert all upercase A's to lower case a's
grep 'a.*a.*a' |          # search for three a's, not necessarily in a row
grep -v "'s$" |           # exclude words ending in 's
rev |                     # reverse characters in each word ( i.e. line )
cut -c1-2 |               # cut the first two characters
rev |                     # reverse charasters in each line
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

In [None]:
cat /tmp/words.clean |
tr 'A' 'a' |              # convert all upercase A's to lower case a's
grep 'a.*a.*a' |          # search for three a's, not necessarily in a row
grep -v "'s$" |           # exclude words ending in 's
sed -re 's/^.*(..)$/\1/'| # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

In [None]:
cat /tmp/words.clean |
grep -i 'a.*a.*a' |       # case insensitive search for three a's, not necessarily in a row
grep -i -v "'s$" |        # exclude words ending in 's
sed -re 's/^.*(..)$/\1/'| # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

In [None]:
cat /tmp/words.clean |
grep -i 'a.*a.*a' |       # case insensitive search for three a's, not necessarily in a row
grep -i -v "'s$" |        # exclude words ending in 's
sed -E 's/^.*(..)$/\1/'|  # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq -c |                 # count unique lines
sort -rn |                # inverse numerical sort
head -3                   # display top three

### Question 2

In [None]:
cat /tmp/words.clean |
grep -i 'a.*a.*a' |       # case insensitive search for three a's, not necessarily in a row
grep -i -v "'s$" |        # exclude words ending in 's
sed -E 's/^.*(..)$/\1/'|  # extract last two letters from a word ( i.e. a line ) 
sort |                    # sort in alphabetical order
uniq > /tmp/letter.pairs  # list of unique letter pairs
wc -l /tmp/letter.pairs   # count the number of unique pairs


### Question 3 -- challenge

In [None]:
head -3 /tmp/letter.pairs

In [None]:
{
echo {a..z}{a..z} |       # generate all possible two-letter combination
tr ' ' '\n'               # replace a space with a newline
cat /tmp/letter.pairs     # list current pairs
} |
sort |                    # sort
uniq -u |                 # filter unique pairs that only appear once
wc -l                     # count number of pairs

## Exercise 3

We'll create a smaller version of the words file that we can work on.

In [None]:
# Create file with list of 100 words
head -100 /tmp/words.clean > /tmp/words.clean.100
wc -l /tmp/words.clean.100


In [None]:
# Show first 10 lines
head /tmp/words.clean.100


In [None]:
# modify a copy of the data
sed 's/A/-/' /tmp/words.clean.100 | head


In [None]:
# Modify a copy of the file using 'g' ( global ) option to sed
sed 's/A/-/g' /tmp/words.clean.100 | head


In [None]:
# Edit file in-place
head /tmp/words.clean.100
echo '===='
sed -i 's/A/-/g' /tmp/words.clean.100
head /tmp/words.clean.100
