## NYU CDS (Center for Data Science)

# DS-GA 3001: Advanced Python for Data Science

## Lecture 01
### 2/1/2021

# Course Details

**Instructor**: Alberto Bietti <alberto.bietti@nyu.edu>

**TAs**: Karanbir Singh Chahal <ksc487@nyu.edu>, Yunxiao Shi <ys3404@nyu.edu>, Agnes Sharan Sahaya Raj Helan <asr647@nyu.edu>

**Lectures**: Mondays 4:55pm-6:35pm, Room GCASL 566, 238 Thompson St

**Labs**:
* 002 (in-person with Karan) Tuesdays from 7:10pm-8:00pm, Room 60FA 110, 60 5th Avenue
* 002 (remote with Yunxiao) Tuesdays from 7:10pm-8:00pm on Zoom
* 003 (online with Agnes) Tuesdays from 8pm-8:50pm

# Materials/links

* All materials and announcements will be posted on NYU Classes <http://newclasses.nyu.edu>
* Ed Stem discussion board: https://edstem.org/us/join/GhtGGZ
* Additional material from older versions of the course available here:
    * https://nyu-cds.github.io/courses/advanced/ (including grading information)


* Optional textbooks for additional information
    * Introduction to Parallel Computing, Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Pearson; 2 edition (January 26, 2003), ISBN 978-0201648652
    * Big Data: Principles and best practices of scalable realtime data systems, 1st Edition, Nathan Marz, James Warren, ISBN 978-1617290343

# Objectives of this course

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Python_logo_and_wordmark.svg/1024px-Python_logo_and_wordmark.svg.png" width="200">
<!-- <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/220px-NumPy_logo_2020.svg.png" width="200"> -->

* Python + Numpy/Scipy are great for data science and scientific computing
    * Data analysis, machine learning, big data processing, computational physics, numerical optimization, etc.


* Python can be slow! Sometimes more is needed for better performance
    * Performance optimization, low-level code
    * Specialized hardware (GPUs)
    * Parallelism, concurrency
    * Distributed computing

* Hands on! Learn by doing
* "Learn how to learn"

# Survey

* Undergrad/Master's/PhD? CDS vs other?
* Python knowledge?
* Shell?
* C/C++?
* Mac/Windows/Linux?

# Installation

Things you should install:
* Shell (bash/zsh, typically included in OS, see [here](https://www.windowscentral.com/how-install-bash-shell-command-line-windows-10) for Windows)
* Python with [Anaconda](https://docs.anaconda.com/anaconda/install/) (or [Miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/)) distribution
* IPython, Jupyter Notebook, numpy
    `conda install ipython jupyter numpy`

# Shell

Material for this lecture from https://swcarpentry.github.io/shell-novice/

# What is "the Shell"?
([source](https://linuxcommand.org/lc3_lts0010.php))

"Simply put, the shell is a program that takes commands from the keyboard and gives them to the operating system to perform. In the old days, it was the only user interface available on a Unix-like system such as Linux. Nowadays, we have *graphical user interfaces (GUIs)* in addition to *command line interfaces (CLIs)* such as the shell.

On most Linux systems a program called **bash** (which stands for Bourne Again SHell, an enhanced version of the original Unix shell program, sh, written by Steve Bourne) acts as the shell program. Besides **bash**, there are other shell programs available for Linux systems. These include: **zsh**, **ksh** and **tcsh**."

# Navigating Files and Directories

* `pwd`: print working directory
* `ls`: list files and directories in current path
    * `-a` (show hidden) `-l` (more details) `-t` (time order)
* `cd`: change directory
* `mkdir`: create directory
* `rm -r`: delete directory (recursively)

Getting help: `<cmd> --help` or `man <cmd>`

# Files

* `touch file.txt`: create empty file
* `nano file.txt` or `vim file.txt`: edit file with a text editor
* `cat file.txt`, `less file.txt`: view content of a text file
<br/><br/>
* `cp file_src file_dst` or `cp file_src dir_dst`: copy file to another file or directory
* `mv file dir`: move file to directory
* `rm file.txt`: delete the file

Note: avoid spaces in filenames, stick to letters, numbers, `. - _`

# Pipes, redirection, filters

* `cat`/`head`/`tail`: show input contents
* `sort`: sort input lines
* `uniq`: de-duplicate sorted inputs
* `cut`: extract fields from file
* `wc`: count lines, words, characters
<br/><br/>
* `cmd > file`: redirect output to file
* `cmd >> file`: append output to file
* `cmd < input_file`: read input from file
* `cmd1 | cmd2`: pipe output of cmd1 as input of cmd2


# Wildcards and regular expressions

* `*` matches any list of characters
    * e.g. `A*`, `*.xml`, `a*.txt`
* `[abc]` matches a single character in the list, `[a-f]` letters between a and f
* `{ab,bcd}` matches either of the options

# Bulk operations: the for loop

E.g. backing up files:

```
for f in *.xml; do cp $f $f.bkp; done
```

# Shell scripts

* You can save commands in a shell script and re-use them

Here, `$1` is the first command-line argument, in this case the desired extension:
```
# contents of the file backup.sh
for f in *.$1; do cp $f $f.bkp; done
```

```bash backup.sh xml```

# Finding things

* `grep 'something' file.txt`: find something in files
    * `-i` (case insensitive) `-v` (invert match) `-n` (line numbers)
* `find dir`: find files in directory
    * `-type d` (directories only) `-type f` (files only) `-name '*.txt'` (match regex)
    * example:
    ```
    wc -l $(find . -name '*.txt')
    ```




### HW for next week: go through https://swcarpentry.github.io/shell-novice/, submit file with name and email on NYU classes

### For labs tomorrow: install anaconda, ipython/jupyter, numpy/scipy/pandas



# Questions?