# Downloading Data on the Command Line

## `curl`

- short for `c`lient for `url`s
- unix command line tool
- basic syntax: `curl [options flags] [URL]` - the URL is required

Example usage if a file is stored at https://websitename.com/datafilename.txt

In [None]:
!curl https://websitename.com/datafilename.txt

### `curl` flags

- `-O`: retains the original filename
- `-o`: give the file a new name (need to add the new file name after)
- `-L`: redirects the HTTP URL if a 300 error code occurs
- `C`: resumes a previous file transfer if it times out before completion

### Wildcards in `curl`

Often times, servers have files with similar filenames like:
- https://websitename.com/datafilename001.txt
- https://websitename.com/datafilename002.txt
- https://websitename.com/datafilename100.txt

We can use wildcards to get them all

In [None]:
!curl -O https://websitename.com/datafilename*.txt

or get just file 1 through 100

In [None]:
!curl -O https://websitename.com/datafilename[001-100].txt

or get just every 10th file

In [None]:
!curl -O https://websitename.com/datafilename[001-100:10].txt

### Sample use

In [3]:
# Download and rename the file in the same step
# -L is needed because it is a tinyurl so a redirect will happen
!curl -o ../datasets/Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1944k  100 1944k    0     0  1767k      0  0:00:01  0:00:01 --:--:-- 1768k


In [5]:
# Download all 100 data files
# saves them with the original filename like datafile001
!curl -O https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile[001-100].txt

# Print all downloaded files to directory
!ls datafile*.txt


[1/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile001.txt --> datafile001.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile001.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[2/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile002.txt --> datafile002.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile002.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[3/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile003.txt --> datafile003.txt
--_curl_--https://s3.amazonaws.com/assets.datacam

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[26/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile026.txt --> datafile026.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile026.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[27/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile027.txt --> datafile027.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile027.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[28/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile028.txt --> datafile028.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile028.txt
  0     0 

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[51/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile051.txt --> datafile051.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile051.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[52/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile052.txt --> datafile052.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile052.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[53/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile053.txt --> datafile053.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile053.txt
  0     0 

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[76/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile076.txt --> datafile076.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile076.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[77/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile077.txt --> datafile077.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile077.txt
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

[78/100]: https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile078.txt --> datafile078.txt
--_curl_--https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile078.txt
  0     0 

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
datafile001.txt datafile021.txt datafile041.txt datafile061.txt datafile081.txt
datafile002.txt datafile022.txt datafile042.txt datafile062.txt datafile082.txt
datafile003.txt datafile023.txt datafile043.txt datafile063.txt datafile083.txt
datafile004.txt datafile024.txt datafile044.txt datafile064.txt datafile084.txt
datafile005.txt datafile025.txt datafile045.txt datafile065.txt datafile085.txt
datafile006.txt datafile026.txt datafile046.txt datafile066.txt datafile086.txt
datafile007.txt datafile027.txt datafile047.txt datafile067.txt datafile087.txt
datafile008.txt datafile028.txt datafile048.txt datafile068.txt datafile088.txt
datafile009.txt datafile029.txt datafile049.txt datafile069.txt datafile089.txt
datafile010.txt datafile030.txt datafile050.txt datafile070.txt datafile090.txt
datafile011.txt datafile031.txt datafile051.txt datafile071.txt datafile091.txt
datafile012.txt datafile032.txt datafile0

## `wget`

- Another command line tool to get data from the internet
- Comes from "world wide web get"

Check if installed:

In [7]:
!which wget

/usr/local/bin/wget


- Basic syntax is like `curl`: `wget [option flags] [URL]
- Some unique `wget` flags are:
    - `-b`: go to background after startup
    - `-q`: turn off the `Wget` output
    - `-c`: resume a broken download
- Flags can be combined like `-bqc`    

### Log file

`wget` generates a log file which we can inspect to check everything went ok. The file is called `wget-log`.

### Passing in a list of urls from a file

`wget` can accept a file with a list of urls to download data from. TO signal we are feeding in urls from a file we use the `-i` flag. 

> all other flags must appear before the `-i` flag (the filename must appear immediately after the `-i`) 

### Ensure `wget` does not consume full bandwidth with a download

We can set an upper download bandwidth limit with `--limit-rate` (whole number automatically converts to kb/s)

In [None]:
!wget --limit-rate=200k -i url_list.txt

### Being considerate of the file host server

We can also set the download to wait so we are not taxing the file server too much. That is accomplished with `--wait` (default is seconds)

In [None]:
!wget --wait=2 -i url_list.txt

## `curl` vs `wget`

- `curl` advantages:
    - can be used to download and upload files from 20+ protocols
    - easier to install across all operating systems
- `wget` advantages:
    - handle multiple file downloads gracefully
    - can handle various file formats for download (e.g. file directory, HTML page)

## Example of `curl` and `wget`

In [None]:
# Use curl, download and rename a single file from URL
!curl -o Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip

# Unzip, delete, then re-name to Spotify201812.csv
!unzip Spotify201812.zip && rm Spotify201812.zip
!mv 201812SpotifyData.csv Spotify201812.csv

# View url_list.txt to verify content
!cat url_list.txt

# Use Wget, limit the download rate to 2500KB/s, download all files in url_list.txt
!wget --limit-rate=2500k -i url_list.txt

# Take a look at all files downloaded
!ls

# Data Cleaning and Munging on the Command Line

## `csvkit`

`csvkit` is a suite of command-line tools to process and clean `csv` files in the command-line.

In [8]:
# written in python so we install it with pip
# the docs are online (not accessible with man)
!pip install csvkit

Collecting csvkit
  Downloading csvkit-1.0.4.tar.gz (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 2.1 MB/s eta 0:00:01
[?25hCollecting agate>=1.6.1
  Downloading agate-1.6.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 9.4 MB/s  eta 0:00:01
[?25hCollecting agate-excel>=0.2.2
  Downloading agate-excel-0.2.3.tar.gz (153 kB)
[K     |████████████████████████████████| 153 kB 23.3 MB/s eta 0:00:01
[?25hCollecting agate-dbf>=0.2.0
  Downloading agate_dbf-0.2.1-py2.py3-none-any.whl (3.7 kB)
Collecting agate-sql>=0.5.3
  Downloading agate-sql-0.5.4.tar.gz (6.3 kB)
Collecting Babel>=2.0
  Downloading Babel-2.8.0-py2.py3-none-any.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 5.4 MB/s eta 0:00:01
[?25hCollecting pytimeparse>=1.1.5
  Downloading pytimeparse-1.1.8-py2.py3-none-any.whl (10.0 kB)
Collecting python-slugify>=1.2.1
  Downloading python-slugify-4.0.0.tar.gz (8.8 kB)
Collecting leather>=0.3.2
  Downloading leather-0.3.3-py

### Common commands

- `in2csv` converts an `xlsx` file to `csv`
    - `in2csv SpotifyData.xlsx > SpotifyData.csv`
    - specifying which sheets to convert
        - `in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv`
    - help with `in2csv -h`

> `in2csv` does not print logs to the console so always double check with `ls`

- `csvlook` prints csv files in the command line in a formatted way
    - `csvlook Spotify_Popularity.csv`
    - help with `csvlook -h`
- `csvstat` is similar to `pd.describe()`
    - `csvstat Spotify_Popularity.csv`

### Filtering data with `csvkit`

A `csv` can be filtered with:
- `csvcut` for columns
- `csvgrep` for rows

#### `csvcut`

In [11]:
# csvcut can filter by col name or col position
# using -n shows us the col names
!csvcut -n ../datasets/Spotify_MusicAttributes.csv

  1: track_id
  2: danceability
  3: duration_ms
  4: instrumentalness
  5: loudness
  6: tempo
  7: time_signature


In [22]:
# we can use names to be extra explicit
!csvcut --names ../datasets/Spotify_MusicAttributes.csv

  1: track_id
  2: danceability
  3: duration_ms
  4: instrumentalness
  5: loudness
  6: tempo
  7: time_signature


In [12]:
# Now we can filter more easily (by position)
!csvcut -c 1 ../datasets/Spotify_MusicAttributes.csv

track_id
118GQ70Sp6pMqn6w1oKuki
6S7cr72a7a8RVAXzDCRj6m
7h2qWpMJzIVtiP30E8VDW4
3KVQFxJ5CWOcbxdpPYdi4o
0JjNrI1xmsTfhaiU1R6OVc
3HjTcZt29JUHg5m60QhlMw
42LWRdkWxM9aWmDImWvH6C
32dMH9MvlTJaABrPHY52Yb
5RCPsfzmEpTXMCTNk7wEfQ
0y0mwXrdEzjUK5Nq8GDPnY
3RSMqu36JZnmMkrnNmnqyd
1o0fkWCltFHVeFIRHqvR5b
2iGShSeV6WcDbez5SLJ2bJ
2rNTo0tGUMW6rn0uHzV5er
5Egkx8edirN0pR2R58C2ME
67r3lnzstENsRYlZWq6DYP
4X8W9SSu9D5MujoxwIwqw6
4lncSzeN8WOH2iHEO593iZ
1L67mcddFQ65MfA3wO3MHV
21DU83QG4jB4mQKh76X32h
08nyEVO684j7pcTAhEY2zJ
4LMVmlX8WXPu8OyPwzkNpR
7JYCpIzpoidDOnnmxmHwtj
0mmFibEg5NuULMwTVN2tRU


In [13]:
# Now we can filter more easily (by name)
!csvcut -c "track_id" ../datasets/Spotify_MusicAttributes.csv

track_id
118GQ70Sp6pMqn6w1oKuki
6S7cr72a7a8RVAXzDCRj6m
7h2qWpMJzIVtiP30E8VDW4
3KVQFxJ5CWOcbxdpPYdi4o
0JjNrI1xmsTfhaiU1R6OVc
3HjTcZt29JUHg5m60QhlMw
42LWRdkWxM9aWmDImWvH6C
32dMH9MvlTJaABrPHY52Yb
5RCPsfzmEpTXMCTNk7wEfQ
0y0mwXrdEzjUK5Nq8GDPnY
3RSMqu36JZnmMkrnNmnqyd
1o0fkWCltFHVeFIRHqvR5b
2iGShSeV6WcDbez5SLJ2bJ
2rNTo0tGUMW6rn0uHzV5er
5Egkx8edirN0pR2R58C2ME
67r3lnzstENsRYlZWq6DYP
4X8W9SSu9D5MujoxwIwqw6
4lncSzeN8WOH2iHEO593iZ
1L67mcddFQ65MfA3wO3MHV
21DU83QG4jB4mQKh76X32h
08nyEVO684j7pcTAhEY2zJ
4LMVmlX8WXPu8OyPwzkNpR
7JYCpIzpoidDOnnmxmHwtj
0mmFibEg5NuULMwTVN2tRU


In [14]:
# Filtering more than one col by position
!csvcut -c 1,2 ../datasets/Spotify_MusicAttributes.csv

track_id,danceability
118GQ70Sp6pMqn6w1oKuki,0.787
6S7cr72a7a8RVAXzDCRj6m,0.777
7h2qWpMJzIVtiP30E8VDW4,0.795999999999999
3KVQFxJ5CWOcbxdpPYdi4o,0.815
0JjNrI1xmsTfhaiU1R6OVc,0.799
3HjTcZt29JUHg5m60QhlMw,0.812
42LWRdkWxM9aWmDImWvH6C,0.810999999999999
32dMH9MvlTJaABrPHY52Yb,0.746
5RCPsfzmEpTXMCTNk7wEfQ,0.813
0y0mwXrdEzjUK5Nq8GDPnY,0.812
3RSMqu36JZnmMkrnNmnqyd,0.814
1o0fkWCltFHVeFIRHqvR5b,0.813
2iGShSeV6WcDbez5SLJ2bJ,0.81
2rNTo0tGUMW6rn0uHzV5er,0.805999999999999
5Egkx8edirN0pR2R58C2ME,0.812
67r3lnzstENsRYlZWq6DYP,0.802
4X8W9SSu9D5MujoxwIwqw6,0.822
4lncSzeN8WOH2iHEO593iZ,0.809
1L67mcddFQ65MfA3wO3MHV,0.805999999999999
21DU83QG4jB4mQKh76X32h,0.812
08nyEVO684j7pcTAhEY2zJ,0.81
4LMVmlX8WXPu8OyPwzkNpR,0.813
7JYCpIzpoidDOnnmxmHwtj,0.759
0mmFibEg5NuULMwTVN2tRU,0.81


In [16]:
# Filtering more than one col by name
# notice no space between the col names
!csvcut -c "track_id","danceability" ../datasets/Spotify_MusicAttributes.csv

track_id,danceability
118GQ70Sp6pMqn6w1oKuki,0.787
6S7cr72a7a8RVAXzDCRj6m,0.777
7h2qWpMJzIVtiP30E8VDW4,0.795999999999999
3KVQFxJ5CWOcbxdpPYdi4o,0.815
0JjNrI1xmsTfhaiU1R6OVc,0.799
3HjTcZt29JUHg5m60QhlMw,0.812
42LWRdkWxM9aWmDImWvH6C,0.810999999999999
32dMH9MvlTJaABrPHY52Yb,0.746
5RCPsfzmEpTXMCTNk7wEfQ,0.813
0y0mwXrdEzjUK5Nq8GDPnY,0.812
3RSMqu36JZnmMkrnNmnqyd,0.814
1o0fkWCltFHVeFIRHqvR5b,0.813
2iGShSeV6WcDbez5SLJ2bJ,0.81
2rNTo0tGUMW6rn0uHzV5er,0.805999999999999
5Egkx8edirN0pR2R58C2ME,0.812
67r3lnzstENsRYlZWq6DYP,0.802
4X8W9SSu9D5MujoxwIwqw6,0.822
4lncSzeN8WOH2iHEO593iZ,0.809
1L67mcddFQ65MfA3wO3MHV,0.805999999999999
21DU83QG4jB4mQKh76X32h,0.812
08nyEVO684j7pcTAhEY2zJ,0.81
4LMVmlX8WXPu8OyPwzkNpR,0.813
7JYCpIzpoidDOnnmxmHwtj,0.759
0mmFibEg5NuULMwTVN2tRU,0.81


#### `csvgrep`

- filters by row using exact match or regex fuzzy matching
- must be paired with one of these options
    - `-m`: followed by the exact row value to filter
    - `-r`: followed by a regex pattern
    - `-l`: followed by the path to a file

In [18]:
# get just rows for this track id by name
!csvgrep -c "track_id" -m 4X8W9SSu9D5MujoxwIwqw6 ../datasets/Spotify_Popularity.csv 

track_id,popularity
4X8W9SSu9D5MujoxwIwqw6,6.0


In [19]:
# get just rows for this track id by id
!csvgrep -c 1 -m 4X8W9SSu9D5MujoxwIwqw6 ../datasets/Spotify_Popularity.csv 

track_id,popularity
4X8W9SSu9D5MujoxwIwqw6,6.0


> Column positions are 1-based unlike in Python!

> think of `-c` as in column and `-m` as... m-row?

## Stacking data and chaining commands with `csvkit`

### `csvstack` to combine multiple csv files

- Similar to `pd.concat`: appends files
- Help with `csvstat -h`
- Sample usage (assuming same file structure) 
    - `csvstack Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv`
- Useful to keep a record of which row came from which file
    - `csvstack -g "Rank6","Rank7" Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv`

In [23]:
# merging the files
!csvstack ../datasets/Spotify_Popularity.csv ../datasets/Spotify_Popularity_1.csv > SpotifyCopy.csv

In [24]:
!cat SpotifyCopy.csv

track_id,popularity
118GQ70Sp6pMqn6w1oKuki,7.0
6S7cr72a7a8RVAXzDCRj6m,7.0
7h2qWpMJzIVtiP30E8VDW4,7.0
3KVQFxJ5CWOcbxdpPYdi4o,7.0
0JjNrI1xmsTfhaiU1R6OVc,7.0
3HjTcZt29JUHg5m60QhlMw,7.0
42LWRdkWxM9aWmDImWvH6C,7.0
32dMH9MvlTJaABrPHY52Yb,7.0
5RCPsfzmEpTXMCTNk7wEfQ,7.0
0y0mwXrdEzjUK5Nq8GDPnY,7.0
3RSMqu36JZnmMkrnNmnqyd,6.0
1o0fkWCltFHVeFIRHqvR5b,6.0
2iGShSeV6WcDbez5SLJ2bJ,6.0
2rNTo0tGUMW6rn0uHzV5er,6.0
5Egkx8edirN0pR2R58C2ME,6.0
67r3lnzstENsRYlZWq6DYP,6.0
4X8W9SSu9D5MujoxwIwqw6,6.0
4lncSzeN8WOH2iHEO593iZ,6.0
1L67mcddFQ65MfA3wO3MHV,6.0
21DU83QG4jB4mQKh76X32h,6.0
08nyEVO684j7pcTAhEY2zJ,6.0
4LMVmlX8WXPu8OyPwzkNpR,6.0
7JYCpIzpoidDOnnmxmHwtj,6.0
0mmFibEg5NuULMwTVN2tRU,6.0
118GQ70Sp6pMqn6w1oKuki,7.0
6S7cr72a7a8RVAXzDCRj6m,7.0
7h2qWpMJzIVtiP30E8VDW4,7.0
3KVQFxJ5CWOcbxdpPYdi4o,7.0
0JjNrI1xmsTfhaiU1R6OVc,7.0
3HjTcZt29JUHg5m60QhlMw,7.0
42LWRdkWxM9aWmDImWvH6C,7.0
32dMH9MvlTJaABrPHY52Yb,7.0
5RCPsfzmEpTXMCTNk7wEfQ,7.0
0y0mwXrdEzjUK5Nq8GDPnY,7.0
3RSMqu36JZnmMkrnNmnqyd,6.0

In [25]:
# adding a flag to know which rows came from where
!csvstack -g "File1","File2" ../datasets/Spotify_Popularity.csv ../datasets/Spotify_Popularity_1.csv > SpotifyCopy.csv

In [26]:
!cat SpotifyCopy.csv

group,track_id,popularity
File1,118GQ70Sp6pMqn6w1oKuki,7.0
File1,6S7cr72a7a8RVAXzDCRj6m,7.0
File1,7h2qWpMJzIVtiP30E8VDW4,7.0
File1,3KVQFxJ5CWOcbxdpPYdi4o,7.0
File1,0JjNrI1xmsTfhaiU1R6OVc,7.0
File1,3HjTcZt29JUHg5m60QhlMw,7.0
File1,42LWRdkWxM9aWmDImWvH6C,7.0
File1,32dMH9MvlTJaABrPHY52Yb,7.0
File1,5RCPsfzmEpTXMCTNk7wEfQ,7.0
File1,0y0mwXrdEzjUK5Nq8GDPnY,7.0
File1,3RSMqu36JZnmMkrnNmnqyd,6.0
File1,1o0fkWCltFHVeFIRHqvR5b,6.0
File1,2iGShSeV6WcDbez5SLJ2bJ,6.0
File1,2rNTo0tGUMW6rn0uHzV5er,6.0
File1,5Egkx8edirN0pR2R58C2ME,6.0
File1,67r3lnzstENsRYlZWq6DYP,6.0
File1,4X8W9SSu9D5MujoxwIwqw6,6.0
File1,4lncSzeN8WOH2iHEO593iZ,6.0
File1,1L67mcddFQ65MfA3wO3MHV,6.0
File1,21DU83QG4jB4mQKh76X32h,6.0
File1,08nyEVO684j7pcTAhEY2zJ,6.0
File1,4LMVmlX8WXPu8OyPwzkNpR,6.0
File1,7JYCpIzpoidDOnnmxmHwtj,6.0
File1,0mmFibEg5NuULMwTVN2tRU,6.0
File2,118GQ70Sp6pMqn6w1oKuki,7.0
File2,6S7cr72a7a8RVAXzDCRj6m,7.0
File2,7h2qWpMJzIVtiP30E8VDW4,7.0
File2,3KVQFxJ5CWOcbxdpPYdi4o,7.0
File2,0JjNrI1xmsTfhai

In [27]:
# adding a flag to know which rows came from where and changing col 
# name from default group to something else
!csvstack -g "File1","File2" -n "source" ../datasets/Spotify_Popularity.csv ../datasets/Spotify_Popularity_1.csv > SpotifyCopy.csv

In [28]:
!cat SpotifyCopy.csv

source,track_id,popularity
File1,118GQ70Sp6pMqn6w1oKuki,7.0
File1,6S7cr72a7a8RVAXzDCRj6m,7.0
File1,7h2qWpMJzIVtiP30E8VDW4,7.0
File1,3KVQFxJ5CWOcbxdpPYdi4o,7.0
File1,0JjNrI1xmsTfhaiU1R6OVc,7.0
File1,3HjTcZt29JUHg5m60QhlMw,7.0
File1,42LWRdkWxM9aWmDImWvH6C,7.0
File1,32dMH9MvlTJaABrPHY52Yb,7.0
File1,5RCPsfzmEpTXMCTNk7wEfQ,7.0
File1,0y0mwXrdEzjUK5Nq8GDPnY,7.0
File1,3RSMqu36JZnmMkrnNmnqyd,6.0
File1,1o0fkWCltFHVeFIRHqvR5b,6.0
File1,2iGShSeV6WcDbez5SLJ2bJ,6.0
File1,2rNTo0tGUMW6rn0uHzV5er,6.0
File1,5Egkx8edirN0pR2R58C2ME,6.0
File1,67r3lnzstENsRYlZWq6DYP,6.0
File1,4X8W9SSu9D5MujoxwIwqw6,6.0
File1,4lncSzeN8WOH2iHEO593iZ,6.0
File1,1L67mcddFQ65MfA3wO3MHV,6.0
File1,21DU83QG4jB4mQKh76X32h,6.0
File1,08nyEVO684j7pcTAhEY2zJ,6.0
File1,4LMVmlX8WXPu8OyPwzkNpR,6.0
File1,7JYCpIzpoidDOnnmxmHwtj,6.0
File1,0mmFibEg5NuULMwTVN2tRU,6.0
File2,118GQ70Sp6pMqn6w1oKuki,7.0
File2,6S7cr72a7a8RVAXzDCRj6m,7.0
File2,7h2qWpMJzIVtiP30E8VDW4,7.0
File2,3KVQFxJ5CWOcbxdpPYdi4o,7.0
File2,0JjNrI1xmsTfha

### Chaining command-line commands

- `;` links commands together and runs sequentially

In [29]:
!csvlook SpotifyCopy.csv; csvstat SpotifyCopy.csv

| source | track_id               | popularity |
| ------ | ---------------------- | ---------- |
| File1  | 118GQ70Sp6pMqn6w1oKuki |          7 |
| File1  | 6S7cr72a7a8RVAXzDCRj6m |          7 |
| File1  | 7h2qWpMJzIVtiP30E8VDW4 |          7 |
| File1  | 3KVQFxJ5CWOcbxdpPYdi4o |          7 |
| File1  | 0JjNrI1xmsTfhaiU1R6OVc |          7 |
| File1  | 3HjTcZt29JUHg5m60QhlMw |          7 |
| File1  | 42LWRdkWxM9aWmDImWvH6C |          7 |
| File1  | 32dMH9MvlTJaABrPHY52Yb |          7 |
| File1  | 5RCPsfzmEpTXMCTNk7wEfQ |          7 |
| File1  | 0y0mwXrdEzjUK5Nq8GDPnY |          7 |
| File1  | 3RSMqu36JZnmMkrnNmnqyd |          6 |
| File1  | 1o0fkWCltFHVeFIRHqvR5b |          6 |
| File1  | 2iGShSeV6WcDbez5SLJ2bJ |          6 |
| File1  | 2rNTo0tGUMW6rn0uHzV5er |          6 |
| File1  | 5Egkx8edirN0pR2R58C2ME |          6 |
| File1  | 67r3lnzstENsRYlZWq6DYP |          6 |
| File1  | 4X8W9SSu9D5MujoxwIwqw6 |          6 |
| File1  | 4lncSzeN8WOH2iHEO593iZ |          6 |
| File1  | 1L67mcddF

- `&&` links commands together but second runs only if the first one succeeds

In [30]:
!csvlook SpotifyCopy.csv && csvstat SpotifyCopy.csv

| source | track_id               | popularity |
| ------ | ---------------------- | ---------- |
| File1  | 118GQ70Sp6pMqn6w1oKuki |          7 |
| File1  | 6S7cr72a7a8RVAXzDCRj6m |          7 |
| File1  | 7h2qWpMJzIVtiP30E8VDW4 |          7 |
| File1  | 3KVQFxJ5CWOcbxdpPYdi4o |          7 |
| File1  | 0JjNrI1xmsTfhaiU1R6OVc |          7 |
| File1  | 3HjTcZt29JUHg5m60QhlMw |          7 |
| File1  | 42LWRdkWxM9aWmDImWvH6C |          7 |
| File1  | 32dMH9MvlTJaABrPHY52Yb |          7 |
| File1  | 5RCPsfzmEpTXMCTNk7wEfQ |          7 |
| File1  | 0y0mwXrdEzjUK5Nq8GDPnY |          7 |
| File1  | 3RSMqu36JZnmMkrnNmnqyd |          6 |
| File1  | 1o0fkWCltFHVeFIRHqvR5b |          6 |
| File1  | 2iGShSeV6WcDbez5SLJ2bJ |          6 |
| File1  | 2rNTo0tGUMW6rn0uHzV5er |          6 |
| File1  | 5Egkx8edirN0pR2R58C2ME |          6 |
| File1  | 67r3lnzstENsRYlZWq6DYP |          6 |
| File1  | 4X8W9SSu9D5MujoxwIwqw6 |          6 |
| File1  | 4lncSzeN8WOH2iHEO593iZ |          6 |
| File1  | 1L67mcddF

- `>` redirects output 
- `|` uses the output of the 1st command as input to the second command

In [32]:
# works but the format sucks, we can pipe it to csvlook
!csvcut -c "track_id","danceability" ../datasets/Spotify_MusicAttributes.csv

track_id,danceability
118GQ70Sp6pMqn6w1oKuki,0.787
6S7cr72a7a8RVAXzDCRj6m,0.777
7h2qWpMJzIVtiP30E8VDW4,0.795999999999999
3KVQFxJ5CWOcbxdpPYdi4o,0.815
0JjNrI1xmsTfhaiU1R6OVc,0.799
3HjTcZt29JUHg5m60QhlMw,0.812
42LWRdkWxM9aWmDImWvH6C,0.810999999999999
32dMH9MvlTJaABrPHY52Yb,0.746
5RCPsfzmEpTXMCTNk7wEfQ,0.813
0y0mwXrdEzjUK5Nq8GDPnY,0.812
3RSMqu36JZnmMkrnNmnqyd,0.814
1o0fkWCltFHVeFIRHqvR5b,0.813
2iGShSeV6WcDbez5SLJ2bJ,0.81
2rNTo0tGUMW6rn0uHzV5er,0.805999999999999
5Egkx8edirN0pR2R58C2ME,0.812
67r3lnzstENsRYlZWq6DYP,0.802
4X8W9SSu9D5MujoxwIwqw6,0.822
4lncSzeN8WOH2iHEO593iZ,0.809
1L67mcddFQ65MfA3wO3MHV,0.805999999999999
21DU83QG4jB4mQKh76X32h,0.812
08nyEVO684j7pcTAhEY2zJ,0.81
4LMVmlX8WXPu8OyPwzkNpR,0.813
7JYCpIzpoidDOnnmxmHwtj,0.759
0mmFibEg5NuULMwTVN2tRU,0.81


In [33]:
# better
!csvcut -c "track_id","danceability" ../datasets/Spotify_MusicAttributes.csv | csvlook

| track_id               | danceability |
| ---------------------- | ------------ |
| 118GQ70Sp6pMqn6w1oKuki |       0.787… |
| 6S7cr72a7a8RVAXzDCRj6m |       0.777… |
| 7h2qWpMJzIVtiP30E8VDW4 |       0.796… |
| 3KVQFxJ5CWOcbxdpPYdi4o |       0.815… |
| 0JjNrI1xmsTfhaiU1R6OVc |       0.799… |
| 3HjTcZt29JUHg5m60QhlMw |       0.812… |
| 42LWRdkWxM9aWmDImWvH6C |       0.811… |
| 32dMH9MvlTJaABrPHY52Yb |       0.746… |
| 5RCPsfzmEpTXMCTNk7wEfQ |       0.813… |
| 0y0mwXrdEzjUK5Nq8GDPnY |       0.812… |
| 3RSMqu36JZnmMkrnNmnqyd |       0.814… |
| 1o0fkWCltFHVeFIRHqvR5b |       0.813… |
| 2iGShSeV6WcDbez5SLJ2bJ |       0.810… |
| 2rNTo0tGUMW6rn0uHzV5er |       0.806… |
| 5Egkx8edirN0pR2R58C2ME |       0.812… |
| 67r3lnzstENsRYlZWq6DYP |       0.802… |
| 4X8W9SSu9D5MujoxwIwqw6 |       0.822… |
| 4lncSzeN8WOH2iHEO593iZ |       0.809… |
| 1L67mcddFQ65MfA3wO3MHV |       0.806… |
| 21DU83QG4jB4mQKh76X32h |       0.812… |
| 08nyEVO684j7pcTAhEY2zJ |       0.810… |
| 4LMVmlX8W

# Database Operations on the Command Line

`sql2csv` is a command in the `csvkit` library which allows us to access data on a variety of SQL databases. It executes the command and saves it into a csv file.

## Syntax

In [None]:
!sql2csv --db "sqlite:///SpotifyDatabase.db" \
         --query "SELECT * FROM Spotify_Popularity" \
         > Spotify_popularity.csv

- `--db` is followed by the database connection string
    - Examples:
        - `sqlite:///` (and ends with `.db`)
        - `postgres:///` or `mysql///`
    - Is a string so needs to be in quotation marks

- `--query` is followed by the actual query
- Is a string so needs to be in quotation marks
- **The query needs to be written in one single line with no breaks**

We then redirect the output.

## Useful flags

We can also add `--verbose` if we are getting an error and want to see the logs.

In [34]:
# the table is not available locally
! sql2csv --db "sqlite:///SpotifyDatabase.db" \
        --query "SELECT * FROM Spotify_Popularity" 

(sqlite3.OperationalError) no such table: Spotify_Popularity
[SQL: SELECT * FROM Spotify_Popularity]
(Background on this error at: http://sqlalche.me/e/e3q8)


## Manipulating data using SQL syntax

We can use SQL to inspect csv files with `csvkit`: it convert the csv file to a temporary SQL database under the hood. This is possible with `csvsql`

> This is not suitable for large file processing! 

In [35]:
# We're using a SQL query ona csv file
!csvsql --query "SELECT * FROM SpotifyCopy LIMIT 1" SpotifyCopy.csv

source,track_id,popularity
File1,118GQ70Sp6pMqn6w1oKuki,7.0


In [37]:
# The output looks better if we pipe it
!csvsql --query "SELECT * FROM SpotifyCopy LIMIT 1" SpotifyCopy.csv | \
    csvlook 

| source | track_id               | popularity |
| ------ | ---------------------- | ---------- |
| File1  | 118GQ70Sp6pMqn6w1oKuki |          7 |


### Joining Files

We can join files too using `csvsql`!

In [54]:
!csvsql --query "SELECT * FROM SpotifyCopy INNER JOIN SpotifyCopy" SpotifyCopy.csv ../datasets/Spotify201812.zip \
    | csvlook 

Your file is not "utf-8" encoded. Please specify the correct encoding with the -e flag or with the PYTHONIOENCODING environment variable. Use the -v flag to see the complete error.



### Saving queries as variables

Sometimes the query can get long so it's more readable to save it as a shell variable and then pass it in

In [58]:
# Preview CSV file
!ls

# Store SQL query as shell variable
!sqlquery="SELECT * FROM SpotifyCopy ORDER BY duration_ms LIMIT 1"

# Apply SQL query to Spotify_MusicAttributes.csv
# output gets a bit weird in jupyter
!csvsql --query "$sqlquery" SpotifyCopy.csv

6_data_processing_in_shell.ipynb datafile050.txt
SpotifyCopy.csv                  datafile051.txt
SpotifyDatabase.db               datafile052.txt
datafile001.txt                  datafile053.txt
datafile002.txt                  datafile054.txt
datafile003.txt                  datafile055.txt
datafile004.txt                  datafile056.txt
datafile005.txt                  datafile057.txt
datafile006.txt                  datafile058.txt
datafile007.txt                  datafile059.txt
datafile008.txt                  datafile060.txt
datafile009.txt                  datafile061.txt
datafile010.txt                  datafile062.txt
datafile011.txt                  datafile063.txt
datafile012.txt                  datafile064.txt
datafile013.txt                  datafile065.txt
datafile014.txt                  datafile066.txt
datafile015.txt                  datafile067.txt
datafile016.txt                  datafile068.txt
datafile017.txt                  datafile069.txt
datafile018.txt     

## Pushing data back to the database

Using `csvsql` we can:
- execute SQL statements directly on a database (`--query`)
- create and insert data (`--insert` and `--db`)

Sample syntax:

In [59]:
# this creates an empty table, specifies schema, and inserts rows
!csvsql --db "sqlite:///SpotifyDatabase.db" --insert SpotifyCopy.csv

In [61]:
# --no-inference for do not assume filetypes
# --no-constraints for generating a schema without length limits or null checks
# helpful for large tables, speeds up the process
!csvsql --no-inference --no-constraints --db "sqlite:///SpotifyDatabase_2.db" --insert SpotifyCopy.csv

# Data pipeline on the command line

## cron

cron is a scheduler. We can use it to launch jobs with some periodicity.

### crontab

Jobs are tracked in the `crontab` file

In [62]:
!crontab -l

crontab: no crontab for miguel.carvalho


In [64]:
!man crontab


CRONTAB(1)                BSD General Commands Manual               CRONTAB(1)

NAME
     crontab -- maintain crontab files for individual users (V3)

SYNOPSIS
     crontab [-u user] file
     crontab [-u user] { -l | -r | -e }

DESCRIPTION
     The crontab utility is the program used to install, deinstall or list the
     tables used to drive the cron(8) daemon in Vixie Cron.  Each user can
     have their own crontab, and they are not intended to be edited directly.

     (Darwin note: Although cron(8) and crontab(5) are officially supported
     under Darwin, their functionality has been absorbed into launchd(8),
     which provides a more flexible way of automatically executing commands.
     See launchctl(1) for more information.)

     If the /usr/lib/cron/cron.allow file exists, then you must be listed
     therein in order to be allowed to use this command.  If the
     /usr/lib/cron/cron.allow file does not exist but the
     /usr/lib/cron/cron.deny file does exist, then you 

#### Adding jobs to `crontab`

Two alternatives:
1. Modify the `crontab` file with `nano`
2. `echo` the schedule command into `crontab`

In [None]:
# Don't run bc I don't want to schedule a job
# !echo "* * * * * python creaete_model.py" | crontab

#### `crontab` syntax

Check https://crontab.guru