## NYU CDS (Center for Data Science)

# DS-GA 3001: Advanced Python for Data Science

## Lab 01
### 2/2/2021

## Details

### Section 2

Karanbir Singh Chahal <ksc487@nyu.edu> (in-person), Yunxiao Shi <ys3404@nyu.edu> (remote)

**Timing**: Tuesdays from 7:10pm-8:00pm EST

### Section 3

Agnes Sharan Sahaya Raj Helan <asr647@nyu.edu> (remote)

**Timing**: Tuesdays from 8:00pm-8:50pm EST

All lab material can be found in NYU Classes under Resources

## Required installation

Things you should've install:
* Shell (bash/zsh, typically included in OS, see [here](https://www.windowscentral.com/how-install-bash-shell-command-line-windows-10) for Windows)
* Python with [Anaconda](https://docs.anaconda.com/anaconda/install/) (or [Miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/)) distribution
* IPython, Jupyter Notebook, numpy, pandas
    `conda install ipython jupyter numpy pandas`
    
### Alternative 

1. Unzip lab1.zip
2. Go to [Google Colab](https://colab.research.google.com/) and upload lab01.ipynb
3. Upload the remaining files to the Files section in the sidebar 

## 1. Shell Script: Producing Reuseable Commands 

- `flightdelays.csv` - data set containing the arrival and departure details of all commercial flights in the US from 2007 
- Check `flightdelays_with_header.csv` for headers
    1. Column 16 - Departure delay
    2. Column 18 - Destination airport

### e.g 0. parse a csv file (process_data.sh)
```text
#!/usr/bin/env bash 
# Tells OS that the script is in bash

echo "Data Processing"
# To store the output of a command as a variable in bash:
# var=$(command)

echo -e "The name of the file is:" $1 "\n"

lines=$(wc -l < $1)
echo -e "The file has" $lines "lines\n"

colnames=$(head -n 1 < $1)
echo "Column names are: "
echo $colnames
```

In [1]:
#To run:
!bash process_data.sh flightdelays_with_header.csv 

Data Processing
The name of the file is: flightdelays_with_header.csv 

The file has 494 lines

Column names are: 
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay


### e.g 1. Calculate minimum, maximum delay in departure (delay.sh)

```text
#!/usr/bin/env bash 

echo -n "Min delay: "
cut -d ',' -f 16 $1|sort -n|head -1

echo -n "Max delay: "
cut -d ',' -f 16 $1|sort -n|tail -1

```

In [3]:
#To run:
!bash delay.sh flightdelays.csv

Min delay: -14
Max delay: 601


### e.g 2. Top 3 destination airports (by the number of arriving planes), unique airports (demoscript.sh)
```text
#!/usr/bin/env bash

echo "The top 3 airports:"
cut -d ',' -f 18 $1|sort |uniq -c |sort -n |tail -3
# uniq -c 
# Precede each output line with the count of the number 
# of times the line occurred in the input,
# followed by a single space

echo "The number of unique airports:"
cut -d ',' -f 18 $1|sort |uniq |wc -l
```

In [3]:
#To run:
!bash demoscript.sh flightdelays.csv

The top 3 airports:
  19 PHX
  24 ORD
  37 ATL
The number of unique airports:
     122


### e.g 3. executing a python program with argments
```text
#!/bin/bash
python greeting_arg.py -n $1 -g $2
```

In [5]:
#To run:
!bash python_shell.sh Alice Welcome

Welcome, Alice!


### e.g 4. executing program on a set of file (do-stats.sh)
```text
#!/usr/bin/env bash
# $@ in refers to all of a shell script's command-line arguments. $1 , $2 , etc.,
# Place variables in quotes if the values might have spaces in them

for datafile in "$@" 
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done
```

In [4]:
#To run:
!bash do-stats.sh NENE*[AB].txt

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt


## 2. Useful Shell Commands in Scrubbing Data

### Get part of the file: head, sed, tail


In [5]:
!seq -f "Line %g" 10

Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10


In [11]:
!seq -f "Line %g" 10 | tee lines
# tee: copies standard input to standard output

Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10


In [12]:
!head -n 3 lines 

Line 1
Line 2
Line 3


In [14]:
!sed '4,10d' lines 
# sed 'm,nd' file

Line 1
Line 2
Line 3


In [15]:
!tail -n 3 lines

Line 8
Line 9
Line 10


### Get part of file using pattern matching (grep)

In [16]:
!grep -i chapter alice.txt # case insensitive

!echo
!grep -E '^CHAPTER .* The' alice.txt 
# regular expression

CHAPTER I Down the Rabbit-Hole
CHAPTER II The Pool of Tears
CHAPTER III A Caucus-Race and a Long Tale
CHAPTER IV. The Rabbit Sends in a Little Bill
CHAPTER V Advice from a Caterpillar
CHAPTER VI Pig and Pepper
CHAPTER VII A Mad Tea-Party
CHAPTER VIII The Queen's Croquet-Ground
CHAPTER IX The Mock Turtle's Story
CHAPTER X The Lobster Quadrille
CHAPTER XI Who Stole the Tarts?
CHAPTER XII Alice's Evidence

CHAPTER II The Pool of Tears
CHAPTER IV. The Rabbit Sends in a Little Bill
CHAPTER VIII The Queen's Croquet-Ground
CHAPTER IX The Mock Turtle's Story
CHAPTER X The Lobster Quadrille


### Replacing and Deleting Values: tr

In [13]:
!echo 'hello world!' | tr -d " "

!echo 'hello world!' | tr " " '_'

!echo 'hello world!' | tr ' !' '_?'

!echo 'hello world!' | tr '[a-z]' '[A-Z]'

'helloworld!'
'hello_world!'_
'hello_world?'_
'HELLO WORLD!' 


## 3. Numpy and Pandas Review 

In [14]:
import numpy as np
import pandas as pd

In [16]:
a = np.array([[1,2,3],[4,5,6]])
#a tuple of integers indicating the shape of the array 
# in each dimension
print('Shape of the array:', a.shape) 
#the total number of elements of the array
print('Total number of elements:', a.size) 
# an object describing the type of the elements in the array
print('Dtype:',a.dtype) 
print('Size in bytes:',a.itemsize) 
#the size in bytes of each element of the array
print('Buffer:',a.data) 
#buffer pointing to the start of array

Shape of the array: (2, 3)
Total number of elements: 6
Dtype: int32
Size in bytes: 4
Buffer: <memory at 0x0000029C13DD2BA0>


In [19]:
## Example of 3-dimensional array in numpy
b = np.array([[[1,2,3]],[[4,5,6]],[[7,8,9]]])
print(b.shape)
print(b.size)
## Note that the shape of the array has 3 elements instead of 2

(3, 1, 3)
9


In [21]:
print(np.ones((3, 3))) # Create an array of all zeros
print(np.zeros((3, 3))) # Create an array of all ones
print(np.full((3,3), 7)) #Create a constant array
print(np.random.rand(3, 3)) # Create an array filled with random values

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[7 7 7]
 [7 7 7]
 [7 7 7]]
[[0.95094168 0.27169011 0.57100091]
 [0.7045408  0.98139342 0.52845482]
 [0.59204091 0.33076546 0.34123077]]


In [22]:
#Array indexing:
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a[1:2, 1:],'\n')
print(a[0, 1])

[[5 6]] 

2


In [26]:
x = np.arange(4).reshape(2, 2)
print(x)
print(np.transpose(x)) #transposes matrix

[[0 1]
 [2 3]]
[[0 2]
 [1 3]]


In [23]:
#Assignment operation
a = np.arange(12)
b = a
a[1] =5
print('b:',b)
#Creates a copy of the array
c = np.copy(b)
b[1] = -2
print('b:',b)
print('c:',c)
print(c.data)

b: [ 0  5  2  3  4  5  6  7  8  9 10 11]
b: [ 0 -2  2  3  4  5  6  7  8  9 10 11]
c: [ 0  5  2  3  4  5  6  7  8  9 10 11]
<memory at 0x0000029C1790BAC0>


In [27]:
index = ['a','b','c','d','e']
series = pd.Series(np.random.randint(0,10,5), index=index) 
# One-dimensional ndarray with axis labels 
# (including time series)
print(series)

a    5
b    9
c    9
d    1
e    6
dtype: int32


In [30]:
print(series[['a']], '\n')
print(series[['a', 'c']],'\n') #how to access
#Slicing
print(series['b':'e'])

a    5
dtype: int32 

a    5
c    9
dtype: int32 

b    9
c    9
d    1
e    6
dtype: int32


In [31]:
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)

   Name  Age
0   tom   10
1  nick   15
2  juli   14


In [37]:
df = pd.read_csv('flightdelays_with_header.csv')
print(df.head())
#arr = np.array(df)
arr = df.to_numpy()
print(arr.shape)
print(arr[0])

   Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
0  2007     11           2          5     1534        1540   1654.0   
1  2007      9          11          2     1822        1823   2258.0   
2  2007      9          24          1      711         715    830.0   
3  2007     11           5          1     2243        2250    549.0   
4  2007      7           8          7     1656        1700   1832.0   

   CRSArrTime UniqueCarrier  FlightNum  ... TaxiIn  TaxiOut  Cancelled  \
0        1710            XE       7770  ...     12       13          0   
1        2257            US        596  ...      7       22          0   
2         844            US       1243  ...      3       10          0   
3         556            NW        222  ...      6       19          0   
4        1840            XE       2607  ...      7       16          0   

   CancellationCode  Diverted  CarrierDelay WeatherDelay NASDelay  \
0               NaN         0             0            0   

In [29]:
!python test.py

0.5175251960754395
0.0049343109130859375


### References:
[Introduction to shell script](https://data36.com/command-line-data-science-introduction-to-bash/)

[Shebang for shell](https://scriptingosx.com/2017/10/on-the-shebang/)

[More about scrubbing data o shell](https://www.datascienceatthecommandline.com/chapter-5-scrubbing-data.html)

[Bash for pipelines](https://towardsdatascience.com/using-bash-for-data-pipelines-cf05af6ded6f)

[More about scripts](https://www.macs.hw.ac.uk/~hwloidl/Courses/LinuxIntro/x961.html)
[regular expression](https://en.wikipedia.org/wiki/Regular_expression)

[Numpy tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html)