# Getting Data + Basic Commands 


Information comes from Ch 9 of Data Science from Scratch, 2nd Edition by Joel Grus. This book is available for free through the library's connection to O'reilly's learning platform.

Additional examples and information from Python Data Science Handbook by Jake VanderPlas. 

## Set up of Notebook. 

The data files for this week are in `week5.files.zip`.  Note, they will be in the following directory structure.   


```bash 
un5550-fa24/
    lec/
        lec.week2/
        ...
        lec.week5/ 
            nb.week5.ipynb
            week5.files.zip
    lab/
        lab01/
        lab02/ 
    ...
```


Now unzip the files, using the `unzip` command function, more on the `!` later this lesson. 

This should expand to look like: 
```bash
un5550-fa24/
    lec/
        lec.week2/
        ...
        lec.week5/ 
            data/
                colon_delimited_stock_prices.txt
                dracula.txt
                ex1.csv
                ex2.csv
                ex3.csv
                H114.ord
                hurricanes.txt
                jane_eyre.txt
                nfl-passing-2018.csv
                rime-intro.txt
                romeo-juliet.txt
                sotu231.txt
                tab_delimited_stock_prices.txt
            egrep.py 
            line_count.py
            most_common_words.py 
            myscript.py
            nb.week5.ipynb
    lab/
        lab01/
        lab02/
        ...
```

In [1]:
!unzip week5.files.zip

Archive:  week5.files.zip
   creating: data/
  inflating: data/nfl-passing-2018.csv  
  inflating: data/tab_delimited_stock_prices.txt  
  inflating: data/colon_delimited_stock_prices.txt  
  inflating: data/rime-intro.txt     
  inflating: data/sotu231.txt        
  inflating: data/ex1.csv            
  inflating: data/ex3.csv            
 extracting: data/ex2.csv            
  inflating: data/README.md          
  inflating: data/dracula.txt        
  inflating: data/H114.ord           
  inflating: data/hurricanes.txt     
  inflating: data/romeo-juliet.txt   
  inflating: data/jane_eyre.txt      
  inflating: egrep.py                
  inflating: line_count.py           
  inflating: most_common_words.py    
  inflating: myscript.py             


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

import csv 
import time 

## IPython Magic Commands 

*from Python Data Science Handbook, by Jake VanderPlas* 

Here we'll begin discussing some of the enhancements that IPython adds on top of the normal Python syntax. These are known in IPython as magic commands, and are prefixed by the % character. These magic commands are designed to succinctly solve various common problems in standard data analysis. 

Magic commands come in two flavors: 

* **line magics**, which are denoted by a single `%` prefix and operate on a single line of input, and 
* **cell magics**, which are denoted by a double `%%` prefix and operate on multiple lines of input. We'll demonstrate and discuss a few brief examples here, and come back to more focused discussion of several useful magic commands.


### Running External Code: `%run` 

As you work on more extensive projects, you may find code available in external .py files.  This code can be run in your notebook with the `%run` magic. 

Here we have a file `myscript.py` with the following contents. 

```{python}
# myscript.py 

def square(x): 
	"""square a number"""
	return x ** 2 

for N in range(1,4): 
	print(N, " squared is ", square(N))
```

You can execute the script in the notebook as:

In [3]:
%run myscript.py

1  squared is  1
2  squared is  4
3  squared is  9


Note, after we've run the script, any functions defined within it are available for use in your notebook.

In [4]:
square(5)

25

### Timing Code Execution: `%timeit` 

Another useful magic function is `%timeit`, which determines the execution time of a **single-line** Python statement that follows it.  For example, here we can examine the performance of list comprehension: 

In [5]:
%timeit L = [n ** 2 for n in range(100)]

18 µs ± 33.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


`%timeit` will automatically perform multiple runs to get robust results.  

To time multi-line statements, the `%%timeit` command is available.  Here we can time list construction in a for-loop: 

In [6]:
%%timeit 
L = []
for n in range(100): 
    L.append(n ** 2)


19.2 µs ± 95.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Ipython Magic functions have documentation available and can be accessed in a standard manner.  For example, we can look at the documentation of the `%timeit` command: 

In [7]:
%timeit?

[0;31mDocstring:[0m
Time execution of a Python statement or expression

Usage, in line mode:
  %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] statement
or in cell mode:
  %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] setup_code
  code
  code...

Time execution of a Python statement or expression using the timeit
module.  This function can be used both as a line and cell magic:

- In line mode you can time a single-line statement (though multiple
  ones can be chained with using semicolons).

- In cell mode, the statement in the first line is used as setup code
  (executed but not timed) and the body of the cell is timed.  The cell
  body has access to any variables created in the setup code.

Options:
-n<N>: execute the given statement <N> times in a loop. If <N> is not
provided, <N> is determined so as to get sufficient accuracy.

-r<R>: number of repeats <R>, each consisting of <N> loops, and take the
best result.
Default: 7

-t: use time.time to measure the time, which is the default on U

We can see a general description of available magic functions with the following: 

In [8]:
%magic


IPython's 'magic' functions

The magic function system provides a series of functions which allow you to
control the behavior of IPython itself, plus a lot of system-type
features. There are two kinds of magics, line-oriented and cell-oriented.

Line magics are prefixed with the % character and work much like OS
command-line calls: they get as an argument the rest of the line, where
arguments are passed without parentheses or quotes.  For example, this will
time the given statement::

        %timeit range(1000)

Cell magics are prefixed with a double %%, and they are functions that get as
an argument not only the rest of the line, but also the lines below it in a
separate argument.  These magics are called with two arguments: the rest of the
call line and the body of the cell, consisting of the lines below the first.
For example::

        %%timeit x = numpy.random.randn((100, 100))
        numpy.linalg.svd(x)

will time the execution of the numpy svd routine, running the assignment 

### A list of available magic functions 

Here is a quick and simple list of all available magic functions: 

In [9]:
%lsmagic 

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%

## Shell Commands 

*from Python Data Science Handbook, by Jake VanderPlas* 

Notebooks give you a syntax for executing shell commands directly from within the notebook. The magic happens with the exclamation point: anything appearing after `!` on a line will be executed not by the Python kernel, but by the system command-line.

The following assumes you're on a Unix-like system, such as Linux or Mac OSX or Colab Notebook. 

### Quick Review: Introduction to the Shell 

The shell is a way to interact textually with your computer. 

Someone unfamiliar with the shell might ask why you would bother with this, when many results can be accomplished by simply clicking on icons and menus. A shell user might reply with another question: why hunt icons and click menus when you can accomplish things much more easily by typing? While it might sound like a typical tech preference impasse, when moving beyond basic tasks it quickly becomes clear that the shell offers much more control of advanced tasks, though admittedly the learning curve can intimidate the average computer user.

As an example, here is a sample of a Linux/OSX shell session where a user explores, creates, and modifies directories and files on their system (`osx:~ $` is the prompt, and everything after the $ sign is the typed command; text that is preceded by a `#` is meant just as description, rather than something you would actually type in):

```bash
osx:~ $ echo "hello world"             # echo is like Python's print function
hello world

osx:~ $ pwd                            # pwd = print working directory
/Users/lebrown                             # this is the "path" that we're sitting in

osx:~ $ ls                             # ls = list working directory contents
Applications       Movies  
Desktop            Music
Documents          Pictures
Downloads          Public
Library            Work

osx:~ $ cd Work/                       # cd = change directory

osx:Work $ pwd
/Users/lebrown/Work

osx:Work $ ls
Courses            Research         Projects
temp.txt

osx:Work $ mkdir myproject             # mkdir = make new directory

osx:Work $ cd myproject/

osx:myproject $ mv ../temp.txt ./      # mv = move file. Here we're moving the
                                       # file temp.txt from one directory
                                       # up (../) to the current directory (./)
osx:myproject $ ls
temp.txt
```

This is just a compact way to do familiar operations (navigating a directory structure, creating a directory, moving a file, etc.) by typing commands rather than clicking icons and menus.
Note that with just a few commands (``pwd``, ``ls``, ``cd``, ``mkdir``, and ``cp``) you can do many of the most common file operations.
It's when you go beyond these basics that the shell approach becomes really powerful.


The same operations are available in a Linux terminal, e.g., using a Linux lab machine or one of the campus Linux servers, `guardian.it.mtu.edu` or `colossus.it.mtu.edu`

```bash 
ssh guardian.it.mtu.edu 

[lebrown@guardian ~]$ ls
'$RECYCLE.BIN'
ANACONDA3
anaconda3-linux 
...
```


### Shell Commands in Notebooks 

Commands that work at the command-line can be used in Notebooks by prefixing it with the ``!`` character.
For example, the ``ls``, ``pwd``, and ``echo`` commands can be run as follows:

```ipython
In [1]: !ls

In [2]: !pwd

In [3]: !echo "printing from the shell"
printing from the shell
```

In [10]:
!ls 

[1m[36mclass[m[m                           myscript.py
[1m[36mdata[m[m                            nb.week5.instructor.ipynb
egrep.py                        nb.week5.part2.instructor.ipynb
line_count.py                   week5.files.zip
most_common_words.py


In [11]:
!pwd

/Users/lebrown/Dropbox/2024c_fall/un5550-f24/un5550-fa24-private/lec/lec.week5


In [12]:
!echo "printing from the shell"

printing from the shell


We will review some more Linux Basics in a few weeks, but here are some references for linux commands: 

* [37 commands you should know](https://www.howtogeek.com/412055/37-important-linux-commands-you-should-know/)
* [Software Carpentry Foundation - Shell Tutorial](http://swcarpentry.github.io/shell-novice/)

### Shell-Related Magic Commands 


In fact, by default you can even use these magic commands without the ``%`` sign.

```ipython
In [15]: cd ..
```

This is known as an ``automagic`` function, and this behavior can be toggled with the ``%automagic`` magic function.

Besides ``%cd``, other available shell-like magic functions are ``%cat``, ``%cp``, ``%env``, ``%ls``, ``%man``, ``%mkdir``, ``%more``, ``%mv``, ``%pwd``, ``%rm``, and ``%rmdir``, any of which can be used without the ``%`` sign if ``automagic`` is on.
This makes it so that you can almost treat the IPython prompt as if it's a normal shell:

```ipython
In [16]: mkdir tmp

In [17]: ls
egrep.py  line_count.py  most_common_words.py  myscript.py  sample_data/  tmp/

In [18]: cp egrep.py tmp/

In [19]: ls tmp
egrep.py

In [20]: rm -r tmp
```

This access to the shell from within the same terminal window as your Python session means that there is a lot less switching back and forth between interpreter and shell as you write your Python code.

In [13]:
%cd data

/Users/lebrown/Dropbox/2024c_fall/un5550-f24/un5550-fa24-private/lec/lec.week5/data


In [14]:
!pwd

/Users/lebrown/Dropbox/2024c_fall/un5550-f24/un5550-fa24-private/lec/lec.week5/data


In [15]:
cd ..

/Users/lebrown/Dropbox/2024c_fall/un5550-f24/un5550-fa24-private/lec/lec.week5


In [16]:
mkdir tmp

In [17]:
ls

[1m[36mclass[m[m/                           myscript.py
[1m[36mdata[m[m/                            nb.week5.instructor.ipynb
egrep.py                         nb.week5.part2.instructor.ipynb
line_count.py                    [1m[36mtmp[m[m/
most_common_words.py             week5.files.zip


In [18]:
cp egrep.py tmp/

In [19]:
ls tmp

egrep.py


In [20]:
rm -r tmp

In [21]:
pwd

'/Users/lebrown/Dropbox/2024c_fall/un5550-f24/un5550-fa24-private/lec/lec.week5'

# Getting Data 

## Use of `stdin` and `stdout`

Data Science from Scratch book has an example using a Python script `egrep.py`, that reads in lines of text and spits back out the lines that match a regular expression passed as an argument to the script.

In [28]:
# In a notebook, how can we view the python script? 
#  Hint: On Linux/Mac machines with Jupyter notebook, 
#  we can use linux commands!
# Note, more expects a signal at the end to exit the command

!more egrep.py 

# egrep.py
import sys, re

# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if it matches the regex, write it to stdout
    if re.search(regex, line):
        sys.stdout.write(line)
[K[?1l>py (END)[m[K

In [29]:
more egrep.py

[0;31m# egrep.py[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0msys[0m[0;34m,[0m [0mre[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# sys.argv is the list of command-line arguments[0m[0;34m[0m
[0;34m[0m[0;31m# sys.argv[0] is the name of the program itself[0m[0;34m[0m
[0;34m[0m[0;31m# sys.argv[1] will be the regex specified at the command line[0m[0;34m[0m
[0;34m[0m[0mregex[0m [0;34m=[0m [0msys[0m[0;34m.[0m[0margv[0m[0;34m[[0m[0;36m1[0m[0;34m][0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# for every line passed into the script[0m[0;34m[0m
[0;34m[0m[0;32mfor[0m [0mline[0m [0;32min[0m [0msys[0m[0;34m.[0m[0mstdin[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;31m# if it matches the regex, write it to stdout[0m[0;34m[0m
[0;34m[0m    [0;32mif[0m [0mre[0m[0;34m.[0m[0msearch[0m[0;34m([0m[0mregex[0m[0;34m,[0m [0mline[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0msys[0m[0;34m.[0m[0ms

In [30]:
!ls

[1m[36mclass[m[m                           myscript.py
[1m[36mdata[m[m                            nb.week5.instructor.ipynb
egrep.py                        nb.week5.part2.instructor.ipynb
line_count.py                   week5.files.zip
most_common_words.py


### Example 1 

Let's try using the `egrep.py` script on a data file `nfl-passing-2018.csv`.  
The data comes from: [https://www.pro-football-reference.com/years/2018/passing.htm](https://www.pro-football-reference.com/years/2018/passing.htm)

In the book, the example at the command line is: 

```python
cat <somefile.txt> | python egrep.py "[0-9]"
```



First, let's look at the command before the `|`.  

`cat` is a standard Unix utility that reads files sequentially, writing them to standard output.

In [31]:
!cat data/nfl-passing-2018.csv

Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger\RoetBe00,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*\LuckAn00,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan\RyanMa00,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
4,Kirk Cousins\CousKi00,MIN,30,QB,16,16,8-7-1,425,606,70.1,4298,30,5.0,10,1.7,75,7.1,7.3,10.1,268.6,99.7,58.2,40,262,6.25,6.48,6.2,1,0
5,Aaron Rodgers*\RodgAa00,GNB,35,QB,16,16,6-9-1,372,597,62.3,4442,25,4.2,2,0.3,75,7.4,8.1,11.9,277.6,97.6,54.4,49,353,6.33,6.96,7.6,3,3
6,Case Keenum\KeenCa00,DEN,30,QB,16,16,6-10-0,365,586,62.3,3890,18,3.1,15,2.6,64,6.6,6.1,10.7,243.1,81.2,45.5,34,235,5.90,5.39,5.5,3,4
7,Patrick Mahomes*+\MahoPa00,KAN,23,QB,16,16,12-4-0,383,580,

The `|` or *pipe* command allows you to redirect standard output (often to standard input or another command). 

In this example the contents of the `nfl` data file are passed as input to the Python egrep.py script. 

In [32]:
# Let's run the command above in a jupyter notebook 
#  !cat data/nfl-passing-2018.csv | python egrep.py "[0-9]"
!cat data/nfl-passing-2018.csv | python egrep.py "[0-9]"

Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger\RoetBe00,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*\LuckAn00,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan\RyanMa00,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
4,Kirk Cousins\CousKi00,MIN,30,QB,16,16,8-7-1,425,606,70.1,4298,30,5.0,10,1.7,75,7.1,7.3,10.1,268.6,99.7,58.2,40,262,6.25,6.48,6.2,1,0
5,Aaron Rodgers*\RodgAa00,GNB,35,QB,16,16,6-9-1,372,597,62.3,4442,25,4.2,2,0.3,75,7.4,8.1,11.9,277.6,97.6,54.4,49,353,6.33,6.96,7.6,3,3
6,Case Keenum\KeenCa00,DEN,30,QB,16,16,6-10-0,365,586,62.3,3890,18,3.1,15,2.6,64,6.6,6.1,10.7,243.1,81.2,45.5,34,235,5.90,5.39,5.5,3,4
7,Patrick Mahomes*+\MahoPa00,KAN,23,QB,16,16,12-4-0,383,580,

In [33]:
!cat data/nfl-passing-2018.csv | python egrep.py "[a-c]"

Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger\RoetBe00,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*\LuckAn00,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan\RyanMa00,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
5,Aaron Rodgers*\RodgAa00,GNB,35,QB,16,16,6-9-1,372,597,62.3,4442,25,4.2,2,0.3,75,7.4,8.1,11.9,277.6,97.6,54.4,49,353,6.33,6.96,7.6,3,3
6,Case Keenum\KeenCa00,DEN,30,QB,16,16,6-10-0,365,586,62.3,3890,18,3.1,15,2.6,64,6.6,6.1,10.7,243.1,81.2,45.5,34,235,5.90,5.39,5.5,3,4
7,Patrick Mahomes*+\MahoPa00,KAN,23,QB,16,16,12-4-0,383,580,66.0,5097,50,8.6,12,2.1,89,8.8,9.6,13.3,318.6,113.8,80.4,26,171,8.13,8.89,4.3,2,2
8,Eli Manning\MannEl00,NYG,37,QB,16,16,5-11-0,380,576

In [34]:
!cat data/nfl-passing-2018.csv | python egrep.py "[q]"

33,Ryan Fitzpatrick\FitzRy00,TAM,36,qb,8,7,2-5-0,164,246,66.7,2366,17,6.9,12,4.9,75,9.6,8.8,14.4,295.8,100.4,62.1,14,76,8.81,8.04,5.4,,
34,Nick Foles\FoleNi00,PHI,29,qb,5,5,4-1-0,141,195,72.3,1413,7,3.6,4,2.1,83,7.2,7.0,10.0,282.6,96.0,67.4,9,47,6.70,6.50,4.4,2,2
35,Brock Osweiler\OsweBr00,MIA,28,qb,7,5,2-3-0,113,178,63.5,1247,6,3.4,4,2.2,75,7.0,6.7,11.0,178.1,86.0,32.3,17,130,5.73,5.42,8.7,1,1
36,Jeff Driskel\DrisJe00,CIN,25,qb,9,5,1-4-0,105,176,59.7,1003,6,3.4,2,1.1,37,5.7,5.9,9.6,111.4,82.2,31.1,16,122,4.59,4.74,8.3,,
37,Lamar Jackson\JackLa00,BAL,21,qb,16,7,6-1-0,99,170,58.2,1201,6,3.5,3,1.8,74,7.1,7.0,12.1,75.1,84.5,46.3,16,71,6.08,5.99,8.6,0,1
38,C.J. Beathard\BeatC.00,SFO,25,qb,6,5,0-5-0,102,169,60.4,1252,8,4.7,7,4.1,82,7.4,6.5,12.3,208.7,81.8,37.3,18,156,5.86,5.03,9.6,,
39,Cody Kessler\KessCo00,JAX,25,qb,5,4,2-2-0,85,131,64.9,709,2,1.5,2,1.5,35,5.4,5.0,8.3,141.8,77.4,26.3,22,149,3.66,3.33,14.4,,
40,Josh McCown\McCoJo01,NYJ,39,qb,4,3,0-3-0,60,110,54.5,539,1,0.9,4,3.6,41,4.9,3.4,

### Example 2 

Another example uses another python script `line_count.py` which counts how many lines match the pattern. 

Following from the book, pipe the results of egrep script to the line_count script. 

```python 
cat <somefile.txt> | python egrep.py "[0-9]" | python line_count.py
```

In [35]:
# Let's consider this same command inside notebook with the nfl data 
!cat data/nfl-passing-2018.csv  | python egrep.py "[0-9]" | python line_count.py

!cat data/nfl-passing-2018.csv  | python egrep.py "[a-zA-Z]" | python line_count.py

!cat data/nfl-passing-2018.csv  | python egrep.py "[a-c]" | python line_count.py

!cat data/nfl-passing-2018.csv | python egrep.py "[q]" | python line_count.py

107
107
88
23


### Example 3 

Consider the same commands with another document, an excerpt of Romeo and Juliet. 


In [37]:
# Display the exerpt 
!cat data/romeo-juliet.txt

Two households, both alike in dignity,
In fair Verona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
A pair of star-cross'd lovers take their life;
Whose misadventured piteous overthrows
Do with their death bury their parents' strife.
The fearful passage of their death-mark'd love,
And the continuance of their parents' rage,
Which, but their children's end, nought could remove,
Is now the two hours' traffic of our stage;
The which if you with patient ears attend,
What here shall miss, our toil shall strive to mend.
Gregory, o' my word, we'll not carry coals.
No, for then we should be colliers.
I mean, an we be in choler, we'll draw.
Ay, while you live, draw your neck out o' the collar.
I strike quickly, being moved.
But thou art not quickly moved to strike.
A dog of the house of Montague moves me.
To move is to stir; and to be valiant is to stand:
therefore, if thou art moved, t

In [38]:
# run the command in JP with the text data 
!cat data/romeo-juliet.txt  | python egrep.py "[0-9]" | python line_count.py

!cat data/romeo-juliet.txt  | python egrep.py "[a-zA-Z]" | python line_count.py

!cat data/romeo-juliet.txt  | python egrep.py "[a-c]" | python line_count.py

!cat data/romeo-juliet.txt  | python egrep.py "Verona" | python line_count.py

0
3093
2937
12


In [39]:
# can examine other files as well 
!cat data/dracula.txt  | python egrep.py "Dracula" | python line_count.py
!cat data/dracula.txt  | python egrep.py "dracula" | python line_count.py
!cat data/dracula.txt  | python egrep.py "nosferatu" | python line_count.py

36
0
2


### Example 4 

Another example uses a script `most_common_words.py` that reads in the words and outputs the most common ones. 

```python 
cat <somefile.txt> | python most_common_words.py 10
```

In [40]:
# Let's look at the script 
!cat most_common_words.py

# most_common_words.py
import sys
from collections import Counter
    
# pass in number of words as first argument
try:
	num_words = int(sys.argv[1])
except:
	print ("usage: most_common_words.py num_words")
	sys.exit(1) # non-zero exit code indicates error

counter = Counter(word.lower()                      # lowercase words 
				  for line in sys.stdin             #
                  for word in line.strip().split()  # split on spaces 
                  if word)                          # skip empty 'words'

for word, count in counter.most_common(num_words): 
	sys.stdout.write(str(count)) 
	sys.stdout.write("\t")
	sys.stdout.write(word) 
	sys.stdout.write("\n")

In [41]:
# run the command to list the 10 most common words in Romeo and Juliet 
!cat data/romeo-juliet.txt | python most_common_words.py 25

661	the
637	and
545	i
510	to
441	a
380	of
357	my
330	is
326	that
304	in
265	thou
235	with
220	you
214	not
210	for
196	be
190	this
183	it
174	me
167	but
167	thy
156	as
130	will
128	what
127	his


In [42]:
!cat data/dracula.txt | python most_common_words.py 25

7983	the
5754	and
4504	to
4499	i
3710	of
2933	a
2509	he
2475	in
2365	that
1804	was
1736	it
1561	as
1493	we
1488	for
1456	is
1445	his
1314	not
1306	with
1213	my
1197	you
1082	at
1049	have
1048	all
1043	be
1020	had


In [43]:
!cat data/jane_eyre.txt | python most_common_words.py 25

7864	the
6424	i
6321	and
5121	to
4435	of
4373	a
2716	in
2402	was
2317	you
2146	my
1776	it
1749	he
1560	as
1466	that
1443	her
1440	had
1414	not
1407	with
1321	she
1271	is
1240	for
1184	me
1182	his
1159	at
1088	but


### Example 5 

Look at another text data file, containing a recent US State of the Union Address: `sotu231.txt`

In [44]:
!cat data/sotu231.txt | python most_common_words.py 20

233	the
211	and
152	to
149	of
115	our
102	we
96	a
79	in
62	that
58	for
57	is
57	will
51	have
40	i
34	are
34	with
32	all
31	american
29	on
29	they


## Reading / Writing data from files 

### Example 6 - Writing to a file 

First, let's create a short text file, writing 3 strings to the file. 


In [45]:
with open('data/testfile.txt', 'w') as fw: 
    fw.write("Hello World!\n")
    fw.write("This is a test.\n")
    fw.write("This is only a test.\n")

After writing to a file, when you are finished be sure to close the file `fw.close()`.  This is done automatically, when the writing is done within a `with` block as shown above. 

*Note, that writing to a file will destroy the file if it already exists!*

In [46]:
cat data/testfile.txt

Hello World!
This is a test.
This is only a test.


### Example 7 - Reading back from the file 

In [47]:
# One option 
f = open('data/testfile.txt', 'r')
f.readlines()

['Hello World!\n', 'This is a test.\n', 'This is only a test.\n']

In [48]:
# Another option 
f = open('data/testfile.txt', 'r')
f.read()

'Hello World!\nThis is a test.\nThis is only a test.\n'

**Q:** What is the difference between the two options? 

### Example 8 - Read line with iterator 



In [49]:
with open('data/testfile.txt', 'r') as f:
    for line in f:
        print(line)

Hello World!

This is a test.

This is only a test.



In [50]:
with open('data/romeo-juliet.txt', 'r') as f:
    for line in f:
        print(line)

Two households, both alike in dignity,

In fair Verona, where we lay our scene,

From ancient grudge break to new mutiny,

Where civil blood makes civil hands unclean.

From forth the fatal loins of these two foes

A pair of star-cross'd lovers take their life;

Whose misadventured piteous overthrows

Do with their death bury their parents' strife.

The fearful passage of their death-mark'd love,

And the continuance of their parents' rage,

Which, but their children's end, nought could remove,

Is now the two hours' traffic of our stage;

The which if you with patient ears attend,

What here shall miss, our toil shall strive to mend.

Gregory, o' my word, we'll not carry coals.

No, for then we should be colliers.

I mean, an we be in choler, we'll draw.

Ay, while you live, draw your neck out o' the collar.

I strike quickly, being moved.

But thou art not quickly moved to strike.

A dog of the house of Montague moves me.

To move is to stir; and to be valiant is to stand:

therefore

### Example 8B - Read line, count number of words / line

Let's take the example from above and change it so that it returns the number of words on each line. 

First, how might we split the string into different words? 

In [51]:
test_str = "Hello world, welcome to Python."
test_str.split()

['Hello', 'world,', 'welcome', 'to', 'Python.']

In [52]:
# Using the example above replace the print(line) and have it print the number of words / line
with open('data/testfile.txt', 'r') as f:
    for line in f:
        # print the number of words in line 
        print(len(line.split()))

2
4
5


### Example 8C - Read line, count words, print formatted output

Let's now print out for each line, two items: the line number and the number of words / line formatted as 

    Line i: j

where `i` is the line number and `j` is the number of words/line

In [53]:
with open('data/testfile.txt', 'r') as f:
    i = 1
    for line in f:
        # Use sprintf-like printing in python
        print("Line %d: %d" % (i, len(line.split())))
        i += 1

Line 1: 2
Line 2: 4
Line 3: 5


### Example 9 - Read / write to files 

Let's read in a file, count the number of words/line, and create a new file with the lines followed by the number of words/line

In [54]:
with open('data/testfile.txt', 'r') as f:
    with open('data/testfile-wpl.txt', 'w') as fw:
        for line in f:
            # print len(line.split())
            fw.write(line.strip() + ' - ' + str(len(line.split())) + '\n')

In [55]:
cat data/testfile-wpl.txt


Hello World! - 2
This is a test. - 4
This is only a test. - 5


## Delimited Files 

Many files that we work with will be delimited: *comma-separated*, *tab-separated*, or a special separator.  For these files, the fields themselves may include commas, tabs, newlines; therefore, trying to parse them yourselves can be challenging.  The preferred method is to use Python's `csv` or `pandas` library (which we have already seen).  

However, here are a few examples doing this by hand.  We use `csv.reader` to iterate over the rows. 

In [56]:
cat data/tab_delimited_stock_prices.txt

6/20/2014	AAPL	90.91
6/20/2014	MSFT	41.68
6/20/2014	FB	64.5
6/19/2014	AAPL	91.86
6/19/2014	MSFT	41.51
6/19/2014	FB	64.34

In [57]:
# import csv 

with open('data/tab_delimited_stock_prices.txt', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for r in reader:
        date = r[0]
        symbol = r[1]
        price = float(r[2])
        print("%s : %s %.1f" % (date, symbol, price))

6/20/2014 : AAPL 90.9
6/20/2014 : MSFT 41.7
6/20/2014 : FB 64.5
6/19/2014 : AAPL 91.9
6/19/2014 : MSFT 41.5
6/19/2014 : FB 64.3


### Example 10 - Read in delimited files with `csv` module

Let's adapt the code from above for a colon delimited file. 

In [58]:
cat data/colon_delimited_stock_prices.txt

date:symbol:closing_price
6/20/2014:AAPL:90.91
6/20/2014:MSFT:41.68
6/20/2014:FB:64.5

In [59]:
with open('data/colon_delimited_stock_prices.txt', 'r') as f:
    reader = csv.DictReader(f, delimiter=':')
    for r in reader:
        date = r["date"]
        symbol = r["symbol"]
        price = float(r["closing_price"])
        print("%s : %s %.1f" % (date, symbol, price))

6/20/2014 : AAPL 90.9
6/20/2014 : MSFT 41.7
6/20/2014 : FB 64.5


### Example 11 - Write to csv 



In [60]:
# import time 

today_prices = {'AAPL' : 90.91, 'MSFT' : 41.68, 'FB' : 64.5 }

with open('data/comma_delim_test.txt', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    for stock, price in today_prices.items():
        writer.writerow([time.strftime("%m/%d/%Y"),stock,price])

In [61]:
cat data/comma_delim_test.txt

09/26/2024,AAPL,90.91
09/26/2024,MSFT,41.68
09/26/2024,FB,64.5


## Reading / Writing Files with `pandas`

`pandas` has several functions to read tabular data (csv, tab-delimited, etc.) as a DataFrame object.

 Function   | Description 
 -----------|-------------
 read_csv   | Load delimited data from a file, URL, or file-like object. Use comma as default delimiter 
 read_table | Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter 
 read_fwf   | Read data in fixed-width column format (that is, no delimiters)  
 read_clipboard | Version of read_table that reads data from the clipboard. Useful for converting tables from web pages 
 ... | ...
 
 


### Example 12 - Import csv file with `pandas`



In [62]:
cat data/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [63]:
# Read in the file: data/ex1.csv  
#   with the read_csv command
df1 = pd.read_csv('data/ex1.csv')
df1

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [64]:
# Read in the file: data/ex1.csv  
#   with the read_table command
df1 = pd.read_table('data/ex1.csv', sep=',')
df1

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### Example 13 


In [65]:
cat data/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [66]:
# Read in the file: data/ex2.csv  
#   with the read_csv command
df2 = pd.read_csv('data/ex2.csv', header=None)
df2

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [67]:
# Read in the file: data/ex2.csv  
#   with the read_csv command
#   add header columns of 'a', 'b', 'c', 'd', 'message'
pd.read_csv('data/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### Example 14 

In [68]:
cat data/ex3.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [69]:
# Read in the file: data/ex5.csv  
#   with the read_csv command
#   replace the missing values 
df3 = pd.read_csv('data/ex3.csv', na_values=['NULL'])
df3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


### Example 15

Let's look at a new type of file to read in. 

The format of the files is:                                                 
                                                                            
1.  3 digits - Congress Number                                      
2.  5 digit - ICPSR ID Number:  code assigned by the ICPSR as corrected by Howard Rosenthal and myself.   
4.   2 digit - State Code:  ICPSR State Code.                        
5.  2 digit - Congressional District Number (0 if Senate):                   
6.  8 chars - State Name:                                                     
7.  3 digits - Party Code:  100 = Dem., 200 = Repub. (See Party3.dat)        
8.  1 digit - ICPSR Occupancy Code:  See any ICPSR Roll Call Voting Codebook 
9.  1 digit - ICPSR Office Code:  See any ICPSR Roll Call Voting Codebook 
10. 11 chars - Name  
11. Votes 

|Code | Meaning |
|-----|---------|
|0	|Not a member of the chamber when this vote was taken |
|1	|Yea |
|2	|Paired Yea |
|3	|Announced Yea |
|4	|Announced Nay |
|5	|Paired Nay |
|6	|Nay |
|7	|Present (some Congresses) |
|8	|Present (some Congresses) |
|9	|Not Voting (Abstention) |

In [70]:
!head data/H114.ord

114107132313MICHIGA 10001CONYERS    7966161161661616111661611116666661616661666166111611166616661119111611666161661616611661116661161116661611961611999999616111666611666111161666911116616616616161616611611611611616666961111116616661616611116666666661611161616166116616161611611166666116616111161699911166161166611661191111619166666169901666116166666661666666616111661161111166661691111616116116616669161161666666116611166611111611111111111111666661111111661166661666611111166111999199966611111611161619999999999999999966111161661611669166616611111116111691661661619616699666166616611161116666616111116666166111616117919611661611166166166661666666966666611661111111666166166661616116616666661616911616161661161666666166616616166161111111666117666661111161111111666111111111161169111169099166611669611169161661116161116666161116611666111666611691116616691166111111161116661661111611166666666611611666616666611166999661611611661111116661116696666616666611116166666161661666166669161666166161616611166616

**Q** What is the best function for this type of data? 

Note, here is an example, where you need to understand and keep meta-data about your data, what the values represent, where are the columns?

In [71]:
# Read in the file: data/H114.ord  
#   what is the best function for this type of data?
votes = pd.read_fwf('data/H114.ord', widths=[3,5,2,2,8,3,2,11,1327], header=None)
votes[1:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8
1,114,13035,13,13,NEW YOR,100,1,RANGEL,9900000161661616111661611116666661616661666166...
2,114,14066,81,1,ALASKA,200,1,YOUNG,9900000000000000111116166661111116161116191611...
3,114,14263,33,8,MINNESO,100,1,NOLAN,9900000000661611116666611116666661616661666166...
4,114,14657,25,5,WISCONS,200,1,SENSENBR,7911611616116161111111166661111116161116111611...


### Example 16 



In [72]:
!head data/hurricanes.txt

Year NamedStorms  Hurricanes  MajorHurricanes  ACE
1851    6   3   1   36
1852    5   5   1   73
1853    8   4   2   76
1854    5   3   1   31
1855    5   4   1   18
1856    6   4   2   49
1857    4   3   0   40
1858    6   6   0   45
1859    8   7   1   56


In [73]:
df = pd.read_table('data/hurricanes.txt', delimiter='\t')
df.head()

Unnamed: 0,Year NamedStorms Hurricanes MajorHurricanes ACE
0,1851 6 3 1 36
1,1852 5 5 1 73
2,1853 8 4 2 76
3,1854 5 3 1 31
4,1855 5 4 1 18


In [74]:
# Read in the file from the NumPy assignment: hurricanes.txt 
df = pd.read_table('data/hurricanes.txt', delimiter='\s+')
df

Unnamed: 0,Year,NamedStorms,Hurricanes,MajorHurricanes,ACE
0,1851,6,3,1,36
1,1852,5,5,1,73
2,1853,8,4,2,76
3,1854,5,3,1,31
4,1855,5,4,1,18
...,...,...,...,...,...
160,2011,19,7,4,126
161,2012,19,10,2,129
162,2013,14,2,0,36
163,2014,8,6,2,67


### Example 17 - Writing out with `pandas` 

Let's look at writing data from a Jupyter notebook back to a file (csv). 

In [75]:
df1

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [76]:
df1.to_csv("data/test.csv")

Look at what get's saved. 

In [77]:
!cat data/test.csv

,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We have the row indices included in what is written to files. This is not always the behavior that we want. 

In [78]:
df1.to_csv("data/test1.csv", index=False)

In [79]:
cat data/test1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
