# Installation and documentation for csvkit

First step in learning about any libraries, tools, or suite of tools is to make sure we are using the latest and most stable version.

Second step is to make sure we know how to access the documentation so we know where to go when we get stuck.

Let's do both in this exercise for csvkit and the various commands in this suite of data processing command-line tools.

```
# Upgrade csvkit using pip  
pip install --upgrade csvkit

# Print manual for in2csv
in2csv -h

# Print manual for csvlook
csvlook -h
```

# Converting and previewing data with csvkit

csvkit is written to process only CSV files. Therefore, the first thing we do is to convert our raw data file into CSV format.

After conversion, it's good practice to take a quick peak into the content of the file for a quick sanity check.

```
# Use ls to find the name of the zipped file
ls

# Use Linux's built in unzip tool to unpack the zipped file 
unzip SpotifyData.zip

# Check to confirm name and location of unzipped file
ls

# Convert SpotifyData.xlsx to csv
in2csv SpotifyData.xlsx > SpotifyData.csv

# Print a preview in console using a csvkit suite command 
csvlook SpotifyData.csv 
```

# File conversion and summary statistics with csvkit

It's common for Excel data files to have more than one worksheet (tab) of data. The Excel file `SpotifyData.xlsx` has two sheets named `Worksheet1_Popularity` and `Worksheet2_MusicAttributes`. Each sheet should be treated like its own data file, so we will use `csvkit`'s commands here to convert each sheet to its own CSV file. Then, using the power of the commands we already know, print a high level summary for each column in the CSV files.

```
# Check to confirm name and location of the Excel data file
ls

# Convert sheet "Worksheet1_Popularity" to CSV
in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv

# Check to confirm name and location of the new CSV file
ls

# Print high level summary statistics for each column
csvstat Spotify_Popularity.csv 
```

```
# Check to confirm name and location of the Excel data file
ls

# Convert sheet "Worksheet2_MusicAttributes" to CSV
in2csv SpotifyData.xlsx --sheet "Worksheet2_MusicAttributes" > Spotify_MusicAttributes.csv

# Check to confirm name and location of the new CSV file
ls

# Print preview of Spotify_MusicAttributes
csvlook Spotify_MusicAttributes.csv
```

# Printing column headers with csvkit

There are many ways to preview the data within `csvkit` alone(e.g. `csvlook`, `csvstat`, etc). However, if all we want is to find the position and name of the columns in our data, it is easier to simply print a string of column headers. Let's print the column headers for the data file `Spotify_MusicAttributes.csv`.

```
# Check to confirm name and location of data file
ls

# Print a list of column headers in data file 
csvcut -n Spotify_MusicAttributes.csv
```

# Filtering data by column with csvkit

Let's get some hands-on practice for filtering data column using the `csvkit` command `csvcut`. Remember that we can filter columns by referring to the position of the column (e.g. 1st column, 2nd column) or by referring to the exact name of the column as it appears in the data file.

```
# Print a list of column headers in the data 
csvcut -n Spotify_MusicAttributes.csv

# Print the first column, by position
csvcut -c 1 Spotify_MusicAttributes.csv
```

```
# Print a list of column headers in the data 
csvcut -n Spotify_MusicAttributes.csv

# Print the first, third, and fifth column, by position
csvcut -c 1,3,5 Spotify_MusicAttributes.csv
```

```
# Print a list of column headers in the data 
csvcut -n Spotify_MusicAttributes.csv

# Print the first column, by name
csvcut -c "track_id" Spotify_MusicAttributes.csv
```

```
# Print a list of column headers in the data 
csvcut -n Spotify_MusicAttributes.csv

# Print the track id, song duration, and loudness, by name 
csvcut -c "track_id","duration_ms","loudness" Spotify_MusicAttributes.csv
```

# Filtering data by row with csvkit

Now it's time get some hands-on practice for filtering data by exact row values using `-m`. Whether it's text or numeric, `csvgrep` can help us filter by these values.

```
# Print a list of column headers in the data 
csvcut -n Spotify_MusicAttributes.csv

# Filter for row(s) where track_id = 118GQ70Sp6pMqn6w1oKuki
csvgrep -c "track_id" -m 118GQ70Sp6pMqn6w1oKuki Spotify_MusicAttributes.csv
```

```
# Print a list of column headers in the data 
csvcut -n Spotify_MusicAttributes.csv

# Filter for row(s) where danceability = 0.812
csvgrep -c "danceability" -m 0.812 Spotify_MusicAttributes.csv
```

# Stacking files with csvkit

`SpotifyData_PopularityRank6.csv` and `SpotifyData_PopularityRank7.csv` have the same file format, column order, and overall data schema. However, one file contains information for songs ranked #6, and the other contains information for songs ranked #7. Combine the two files together into one unified file by stacking them.

```
# Stack the two files and save results as a new file
csvstack SpotifyData_PopularityRank6.csv SpotifyData_PopularityRank7.csv > SpotifyPopularity.csv

# Preview the newly created file 
csvlook SpotifyPopularity.csv
```

# Chaining commands using operators

The more we use command-line tools, the more we start stringing complex commands together. Sometimes it's for convenience, but other times, the output of one command can be used as input to another. Let's get some hands on practice with this by filling in the correct chain operators for the circumstances described in the instructions below.

```
# If csvlook succeeds, then run csvstat 
csvlook Spotify_Popularity.csv && csvstat Spotify_Popularity.csv

# Use the output of csvsort as input to csvlook
csvsort -c 2 Spotify_Popularity.csv | csvlook

# Take top 15 rows from sorted output and save to new file
csvsort -c 2 Spotify_Popularity.csv | head -n 15 > Spotify_Popularity_Top15.csv

# Preview the new file 
csvlook Spotify_Popularity_Top15.csv

```

# Data processing with csvkit

Once we have assembled a dataset, we still need to process and clean the data prior to more advanced analysis such as predictive modeling. In this capstone exercise, let's make use of various commands in csvkit for some common data processing and cleaning.

The Excel file `Spotify_201809_201810.xlsx` contains two sheets (tabs), named `Spotify201809` and `Spotify201810`. First, we will split the Excel file down to its individual sheets, preview summary statistics, remove some columns, and then stack the two sheets back together again to form one single csv file, ready for further analysis.

```
# Convert the Spotify201809 tab into its own csv file 
in2csv Spotify_201809_201810.xlsx --sheet "Spotify201809" > Spotify201809.csv

# Check to confirm name and location of data file
ls

# Preview file preview using a csvkit function
csvlook Spotify201809.csv

# Create a new csv with 2 columns: track_id and popularity
csvcut -c "track_id","popularity" Spotify201809.csv > Spotify201809_subset.csv

# While stacking the 2 files, create a data source column
csvstack -g "Sep2018","Oct2018" Spotify201809_subset.csv Spotify201810_subset.csv > Spotify_all_rankings.csv
```