# Unix Shell

There is a lot that can be done on the Unix shell command prompt. For homework, we will do some useful manipulations of CSV files.

There is plenty of material online that will help you figure out how to do various tasks on the command line. Some example resources I found by googling:

* Paths and Wildcards: https://www.warp.dev/terminus/linux-wildcards
* General introduction to shell: https://github-pages.ucl.ac.uk/RCPSTrainingMaterials/HPCandHTCusingLegion/2_intro_to_shell.html
* Manual pages: https://www.geeksforgeeks.org/linux-man-page-entries-different-types/?ref=ml_lbp
* Chaining commands: https://www.geeksforgeeks.org/chaining-commands-in-linux/?ref=ml_lbp
* Piping: https://www.geeksforgeeks.org/piping-in-unix-or-linux/
* Using sed: https://www.geeksforgeeks.org/sed-command-linux-set-2/?ref=ml_lbp
* Various Unix commands: https://www.geeksforgeeks.org/linux-commands/?ref=lbp
* Cheat sheets:
    * https://www.stationx.net/unix-commands-cheat-sheet/
    * https://cheatography.com/davechild/cheat-sheets/linux-command-line/
    * https://www.theknowledgeacademy.com/blog/unix-commands-cheat-sheet/
    
These aren't necessarily the best resources. Feel free to search for better ones. Also, don't forget that Unix has built-in manual pages for all of its commands. Just type `man <command>` at the command prompt. Use the space-bar to scroll through the documentation and "q" to exit.

1. After unziping the Kaggle CSV files, make a new directory for the original zip files, and move the files there. In case you accidentally mess up one of the CSV files, you'll be able unzip the data again.

In [None]:
mkdir Zip-Files.csv
cd  Zip-Files.csv

kaggle datasets download -d henryshan/starbucks
kaggle datasets download -d vjchoudhary7/customer-segmentation-tutorial-in-python
kaggle datasets download -d yasserh/titanic-dataset
kaggle datasets download -d rishidamarla/heart-disease-prediction
kaggle datasets download -d ananthr1/weather-prediction
kaggle datasets download -d harishkumardatalab/housing-price-prediction
kaggle datasets download -d muhammadbinimran/housing-price-prediction-data
kaggle datasets download -d rafsunahmad/world-all-university-ranking-factors
kaggle datasets download -d ayaz11/used-car-price-prediction
kaggle datasets download -d iammustafatz/diabetes-prediction-dataset
kaggle datasets download -d thedevastator/cancer-patients-and-air-pollution-a-new-link
kaggle datasets download -d imtkaggleteam/breast-cancer

2. The "diabetes_prediction_dataset.csv" file has a lot of entries. Create 3 new CSV files, each with about 1/3 of the data.

In [None]:
# Put the first line of the file into the new files
head -1 diabetes_prediction_dataset.csv >> DIABETES2.csv
head -1 diabetes_prediction_dataset.csv >> DIABETES3.csv

# Put first 3335 of file into first diabetes file
head -n 3335 diabetes_prediction_dataset.csv >> DIABETES1.csv

# Put lines 3336-6668 into second diabetes file
head -n 6668 diabetes_prediction_dataset.csv | tail -n +3336 >> DIABETES2.csv

# Put lines 6669-10,001 into third diabetes file
head -n 10001 diabetes_prediction_dataset.csv | tail -n +6669 >> DIABETES3.csv

3. Create 2 new CSV files from Heart_Disease_Prediction.csv, one containing rows with "Presence" label and another with "Absence" label. Make sure that the first line of each file contains the field names.

In [None]:
# Create 2 files with the head line of OG .csv
head -n 1 Heart_Disease_Prediction.csv > heart-presence-file.csv
head -n 1 Heart_Disease_Prediction.csv > heart-absence-file.csv

# Select lines with "Presence" and put in to Presence File
grep "Presence" Heart_Disease_Prediction.csv >> heart-presence-file.csv

# Select lines with "Absence" and put in to Absence File
grep "Absence" Heart_Disease_Prediction.csv >> heart-absence-file.csv

4. What fraction of cars in car_web_scraped_dataset.csv have had no accidents?

In [None]:
# Take the word count of grep "No accidents"
grep "No accidents" car_web_scraped_dataset.csv | wc -l
# -> 2223

# Take the word count of everything
wc -l car_web_scraped_dataset.csv
# ->  2841

In [2]:
2223/2841

0.7824709609292503

5. Make the following replacements in Housing.csv, output the result into a new CSV:
- yes -> 1
- no -> 0
- unfurnished -> 0
- furnished -> 1
- semi-furnished -> 2

In [None]:
# All in one line, make the replacements
sed 's/yes/1/g; s/no/0/g; s/unfurnished/0/g; s/,furnished/,1/g; s/semi-furnished/2/g' Housing.csv > New-Housing.csv

- s/ tells sed to substitute every occurrence of the pattern (”yes” or “no” or “unfurnished”, etc.)
- /g at the end stands for “global” and makes sure all occurrences of the pattern on EACH LINE are replaced, not just the first

6. Create a new CSV files from Mall_Customers, removing "CustomerID" column.

In [None]:
cut -d ',' -f 2- Mall_Customers.csv > New_Mall_Customers.csv

- cut -d ‘,’ makes the separator a comma
- -f 2- says we want to keep columns starting from the second column onwards ⇒ removing “CustomerID”

7. `world all university rank and rank score.csv’, Create a new file that contains the sum of the following fields for each row:
    
    Research Quality Score
    
    Industry Score
    
    International Outlook
    
    Research Environment Score

In [None]:
# First, install bc
sudo apt install bc

#Then take the columns
cut -f 5,6,7,8 -d ',' 'world all university rank and rank score.csv' | tr -s ',' '+' | bc > Summed-Rank-Scores.csv

#I kept getting 'illegal characters' and 'syntax' errors

# This also did not work
cut -f 5,6,7,8 -d ',' world\ all\ university\ rank\ and\ rank\ score.csv | tr -s ',' '+' | bc > Summed-Rank-Scores.csv

- cut -d ‘,’ makes the separator a comma
- -f 5,6,7,8 will focus on the 5th-8th columns in each row (which includes the Research Quality Score, Industry Score, International Outlook, and Research Environment Score columns
- tr ‘,’ ‘+’ will replace the comma between the 5th-8th columns to plus signs for summing
- bc will calculate the +

8. Sort the "cancer patient data sets.csv" file by age. Make sure the output is a readable CSV file.

In [None]:
sort -t ',' -k 3 -n 'cancer patient data sets.csv' > sorted_cancer_patient_data.csv