# UNIX Commands for Data Scientists

## Declare Filename

In [1]:
!ls ./unix

shakespeare.txt


In [2]:
filename = './unix/shakespeare.txt'
!echo $filename

./unix/shakespeare.txt


## head

In [3]:
!head -n 3 $filename

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg


## tail

In [4]:
!tail -n 10 $filename

PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>



End of this Etext of The Complete Works of William Shakespeare





## wc

In [5]:
!wc $filename

  124505  901447 5583442 ./unix/shakespeare.txt


In [6]:
!wc -l $filename

  124505 ./unix/shakespeare.txt


## cat

In [7]:
!cat $filename | wc -l 

  124505


## grep

In [8]:
!grep -i 'parchment' $filename

  If the skin were parchment, and the blows you gave were ink,
  Ham. Is not parchment made of sheepskins?
    of the skin of an innocent lamb should be made parchment? That
    parchment, being scribbl'd o'er, should undo a man? Some say the
    Upon a parchment, and against this fire
    But here's a parchment with the seal of Caesar;  
    With inky blots and rotten parchment bonds;
    Nor brass, nor stone, nor parchment, bears not one,


In [9]:
#output matching pattern one per line and then count number of lines

!cat $filename | grep -o 'liberty' | wc -l

      71


## sed

In [10]:
#replace all instances of 'parchment' to 'manuscript'

!sed -e 's/parchment/manuscript/g' $filename > temp.txt

In [11]:
!grep -i 'manuscript' temp.txt 

  If the skin were manuscript, and the blows you gave were ink,
  Ham. Is not manuscript made of sheepskins?
    of the skin of an innocent lamb should be made manuscript? That
    manuscript, being scribbl'd o'er, should undo a man? Some say the
    Upon a manuscript, and against this fire
    But here's a manuscript with the seal of Caesar;  
    With inky blots and rotten manuscript bonds;
    Nor brass, nor stone, nor manuscript, bears not one,


## sort

In [12]:
#SORT
!head -n 5 $filename

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!



In [13]:
!head -n 5 $filename | sort


Library of the Future and Shakespeare CDROMS.  Project Gutenberg
This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
often releases Etexts that are NOT placed in the Public Domain!!


In [14]:
# columns separated by ' ', sort on column 2 (-k2), case insensitive (-f)
!head -n 5 $filename | sort -f -t' ' -k2


This is the 100th Etext file presented by Project Gutenberg, and
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
is presented in cooperation with World Library, Inc., from their
often releases Etexts that are NOT placed in the Public Domain!!


In [15]:
!sort $filename | wc -l

  124505


In [16]:
# uniq command for getting unique records using -u option

!sort $filename | uniq -u | wc -l

  110834


# Lets bring it all together

## Count most frequent words in the text un UNIX

In [None]:
!sed -e 's/\s/\n/g' < $filename | sort | uniq -c | sort -nr | head -10

9454 
 220 PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
 219 WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
 219 SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
 219 SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>
 219 PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
 219 DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
 219 COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
 219 <<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
 161   
  93 SCENE II.
  74 SCENE III.
  56                                                          Exeunt.
  50 SCENE IV.
  44                                                           Exeunt
  39  Exit
  38 by William Shakenpeare
  37 THE END
  31 SCENE V.
  26 Scene II.
  22 Scene III.
  22 Exit.
  21 London. The palace
  21 ACT I. SCENE I.
  20 SCENE VI.
  20 ACT III. SCENE I.
  19 SCENE:
  19 Dramatin Pernonae
  19 ACT V. SCENE I.
  19 ACT IV. 

In [18]:
# head stops after 15 lines, hence 'sort' command is sending the last two lines

## Write the output to a file

In [20]:
!sed -e 's/\s/\n/g' < $filename | sort | uniq -c | sort -nr | head -10 > count_vs_words

sort: write failed: standard output: Broken pipe
sort: write error


In [None]:
!cat count_vs_words

## Plot by importing wordcounts into Python

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import csv

xTicks = []
y = []

with open('count_vs_words','r') as csvfile:
    plots = csv.reader(csvfile, delimiter=' ')
    for row in plots:
        y.append(int(row[-2]))
        xTicks.append(str(row[-1]))

#remove the count of spaces (first line)
y = y[1:]
xTicks = xTicks[1:]
#plot
x = range(len(y))
plt.figure(figsize=(10,10))
plt.xticks(x, xTicks, rotation=90) #xlabel  with 90 degree angle
plt.plot(x,y,'*')
plt.show()