# Unix Shell

There is a lot that can be done on the Unix shell command prompt. For homework, we will do some useful manipulations of CSV files.

There is plenty of material online that will help you figure out how to do various tasks on the command line. Some example resources I found by googling:

* Paths and Wildcards: https://www.warp.dev/terminus/linux-wildcards
* General introduction to shell: https://github-pages.ucl.ac.uk/RCPSTrainingMaterials/HPCandHTCusingLegion/2_intro_to_shell.html
* Manual pages: https://www.geeksforgeeks.org/linux-man-page-entries-different-types/?ref=ml_lbp
* Chaining commands: https://www.geeksforgeeks.org/chaining-commands-in-linux/?ref=ml_lbp
* Piping: https://www.geeksforgeeks.org/piping-in-unix-or-linux/
* Using sed: https://www.geeksforgeeks.org/sed-command-linux-set-2/?ref=ml_lbp
* Various Unix commands: https://www.geeksforgeeks.org/linux-commands/?ref=lbp
* Cheat sheets:
    * https://www.stationx.net/unix-commands-cheat-sheet/
    * https://cheatography.com/davechild/cheat-sheets/linux-command-line/
    * https://www.theknowledgeacademy.com/blog/unix-commands-cheat-sheet/
    
These aren't necessarily the best resource. Feel free to search for better ones. Also, don't forget that Unix has built-in manual pages for all of its commands. Just type `man <command>` at the command prompt. Use the space-bar to scroll through the documentation and "q" to exit.

## Homework

Perform all of these tasks on the Unix command prompt. Some may require several commands. Many will require chaining commands together. Once you figure out how to perform the task, copy paste the command(s) here.  

1. After unziping the Kaggle CSV files, make a new directory for the original zip files, and move the files there. In case you accidentally mess up one of the CSV files, you'll be able unzip the data again. 

Hint: use `mkdir` and `mv` commands with appropriate wildcards.

2. The "diabetes_prediction_dataset.csv" file has a lot of entries. Create 3 new CSV files, each with about 1/3 of the data.

Hints: 
* Use `head` to get first line.  
* First create 3 files with just the first line by redirecting output of `head` into a file using `>`.
* Use `wc` to count the number of entries
* Chain/pipe `head` and `tail` to select specific lines, redirecting output to append to the 3 files you created using `>>`.

3. Create 2 new CSV files from `Heart_Disease_Prediction.csv`, one containing rows with "Presence" label and another with "Absence" label. Make sure that the first line of each file contains the field names. 

Hints: 
* Use `head` to get first line.  
* First create 2 files with just the first line by redirecting output of `head` into a file using `>`.
* Use `grep` to select lines that contain "Absence" or "Presence" and append the output to the appropriate file created in the previous step.

4. What fraction of cars in `car_web_scraped_dataset.csv` have had no accidents?

Hints:
* Use `grep` to select the appropriate lines.
* Pipe the output of grep into `wc` (using `|`) to count the lines.

5. Make the following replacements in `Housing.csv`, output the result into a new CSV:

* yes -> 1
* no -> 0
* unfurnished -> 0
* furnished -> 1
* semi-furnished -> 2
    
Hints:
* Use `sed` to do the replacement.
* Use pipes to chain multiple `sed` commands.
* To avoid replacing "unfurnished" or "semi-furnished" when performing the "furnished" replacement, try replacing ",furnished" with ",1".

6. Create a new CSV files from `Mall_Customers`, removing "CustomerID" column.

Hints:
* Use `cut` command
* Default separator for `cut` is the space character. For CSV, you have to use option `-d ','`.

7. Create a new file that contains the sum of the following fields for each row:
    * Research Quality Score
    * Industry Score
    * International Outlook
    * Research Environment Score
    
Hints:
* Use `cut` to select the correct columns.
* Use `tr` to replace ',' with '+'.
* Pipe output into `bc` to compute the sum.

8. Sort the "cancer patient data sets.csv" file by age. Make sure the output is a readable CSV file.

Hints:
* Use sort with `-n`, `-t`, and `-k` options. 

## 1
#### After unziping the Kaggle CSV files, make a new directory for the original zip files, and move the files there. In case you accidentally mess up one of the CSV files, you'll be able unzip the data again. 

#### Hint: use `mkdir` and `mv` commands with appropriate wildcards.

`mkdir datsets`<br>
`mv diabetes-prediction-dataset.zip datasets`<br>
`ls` <br><br>
 
DATA.4380.Spring.2024  DATA4380_Spring24  KidneyStoneAnalysis  datasets  diabetes_prediction_dataset.csv<br><br>

## 2
#### The "diabetes_prediction_dataset.csv" file has a lot of entries. Create 3 new CSV files, each with about 1/3 of the data.

#### Hints: 
* Use `head` to get first line.  
* First create 3 files with just the first line by redirecting output of `head` into a file using `>`.
* Use `wc` to count the number of entries
* Chain/pipe `head` and `tail` to select specific lines, redirecting output to append to the 3 files you created using `>>`.<br><br><br>

`head diabetes_prediction_dataset.csv`<br><br>
 
gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes<br>
Female,80.0,0,1,never,25.19,6.6,140,0<br>
Female,54.0,0,0,No Info,27.32,6.6,80,0<br>
Male,28.0,0,0,never,27.32,5.7,158,0<br>
Female,36.0,0,0,current,23.45,5.0,155,0<br>
Male,76.0,1,1,current,20.14,4.8,155,0<br>
Female,20.0,0,0,never,27.32,6.6,85,0<br>
Female,44.0,0,0,never,19.31,6.5,200,1<br>
Female,79.0,0,0,No Info,23.86,5.7,85,0<br>
Male,42.0,0,0,never,33.64,4.8,145,0<br><br>

`wc -l < diabetes_prediction_dataset.csv`<br><br>

100001<br><br>

`head -n 1 diabetes_prediction_dataset.csv > diabetes_predict1.csv`<br>
`head -n 1 diabetes_prediction_dataset.csv > diabetes_predict2.csv`<br>
`head -n 1 diabetes_prediction_dataset.csv > diabetes_predict3.csv`<br><br>

`sed -n '2,33333p' diabetes_prediction_dataset.csv >> diabetes_predict1.csv`<br>
`sed -n '33334,66666p' diabetes_prediction_dataset.csv >> diabetes_predict2.csv`<br>
`sed -n '66666,100001p' diabetes_prediction_dataset.csv >> diabetes_predict3.csv`<br>
`wc -l < diabetes_predict1.csv`<br>
<br><br>
33334<br><br>

`head diabetes_predict1.csv`<br><br>

gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes<br>
Female,80.0,0,1,never,25.19,6.6,140,0<br>
Female,54.0,0,0,No Info,27.32,6.6,80,0<br>
Male,28.0,0,0,never,27.32,5.7,158,0<br>
Female,36.0,0,0,current,23.45,5.0,155,0<br>
Male,76.0,1,1,current,20.14,4.8,155,0<br>
Female,20.0,0,0,never,27.32,6.6,85,0<br>
Female,44.0,0,0,never,19.31,6.5,200,1<br>
Female,79.0,0,0,No Info,23.86,5.7,85,0<br>
Male,42.0,0,0,never,33.64,4.8,145,0<br>
<br>
`wc -l < diabetes_predict2.csv`<br><br>

33334<br><br>

`head diabetes_predict2.csv`<br><br>

gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes<br>
Male,80.0,0,0,ever,30.23,4.8,85,0<br>
Female,12.0,0,0,No Info,27.32,6.0,200,0<br>
Male,37.0,0,0,never,27.9,3.5,160,0<br>
Female,28.0,0,0,never,25.08,3.5,80,0<br>
Female,13.0,0,0,No Info,27.45,4.5,130,0<br>
Female,4.0,0,0,No Info,19.49,6.5,100,0<br>
Male,35.0,0,0,No Info,27.32,6.6,140,0<br>
Male,10.0,0,0,No Info,24.78,5.7,130,0<br>
Female,39.0,0,0,current,23.41,5.7,100,0<br><br>

`wc -l < diabetes_predict3.csv`<br><br>

33337<br><br>

`head diabetes_predict3.csv`<br><br>

gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes<br>
Female,53.0,0,0,never,50.88,8.2,200,1<br>
Female,33.0,0,0,never,20.73,6.6,158,0<br>
Female,63.0,0,0,No Info,27.32,6.2,200,0<br>
Female,40.0,0,0,never,41.97,6.2,100,0<br>
Female,80.0,1,0,never,27.32,6.5,220,1<br>
Male,54.0,0,0,never,39.85,6.5,130,1<br>
Male,38.0,0,0,never,27.32,6.6,145,0<br>
Male,54.0,0,0,No Info,32.14,6.2,140,0<br>
Female,80.0,0,0,No Info,20.23,5.7,80,0<br>

