# Unix Shell

There is a lot that can be done on the Unix shell command prompt. For homework, we will do some useful manipulations of CSV files.

There is plenty of material online that will help you figure out how to do various tasks on the command line. Some example resources I found by googling:

* Paths and Wildcards: https://www.warp.dev/terminus/linux-wildcards
* General introduction to shell: https://github-pages.ucl.ac.uk/RCPSTrainingMaterials/HPCandHTCusingLegion/2_intro_to_shell.html
* Manual pages: https://www.geeksforgeeks.org/linux-man-page-entries-different-types/?ref=ml_lbp
* Chaining commands: https://www.geeksforgeeks.org/chaining-commands-in-linux/?ref=ml_lbp
* Piping: https://www.geeksforgeeks.org/piping-in-unix-or-linux/
* Using sed: https://www.geeksforgeeks.org/sed-command-linux-set-2/?ref=ml_lbp
* Various Unix commands: https://www.geeksforgeeks.org/linux-commands/?ref=lbp
* Cheat sheets:
    * https://www.stationx.net/unix-commands-cheat-sheet/
    * https://cheatography.com/davechild/cheat-sheets/linux-command-line/
    * https://www.theknowledgeacademy.com/blog/unix-commands-cheat-sheet/
    
These aren't necessarily the best resources. Feel free to search for better ones. Also, don't forget that Unix has built-in manual pages for all of its commands. Just type `man <command>` at the command prompt. Use the space-bar to scroll through the documentation and "q" to exit.

## Homework (Due Friday 6/13)

### Setup 

1. Make sure you have setup:
    * Laptop setup with python, jupyter, and usual Data Science stack install via pip.
    * Note if you are using Windows, you must setup via WSL.
        * Must know how to read files in Windows disks from WSL Ubuntu VM?
        * Must know how to read files in your WSL Ubuntu VM from Windows?

2. Make sure you are setup to use GitHub on the command line in you Linux / MacOS environment:
    * Make sure GitHub is properly setup
        * Authentication
        * Demonstrate you can push from the command prompt.
    * Create a new repository for you work. Name it appropriately. Make it public. 
    * Organize it. e.g. You will do various projects. e.g. Sub-directories for each project.

3. Install Kaggle API:
    * Install [kaggle API](https://www.kaggle.com/docs/api).
        * Setup your PATH environment variable.
        * Be able to edit text files

4. Create a directory in your Linux / MacOS filesystem where you will store the files. We will be using CSV files from the Kaggle challenges and datasets listed here. Download and unzip all of the datasets **using the Kaggle API**.
    * https://www.kaggle.com/competitions/diabetes-prediction-with-nn
    * https://www.kaggle.com/datasets/rishidamarla/heart-disease-prediction
    * https://www.kaggle.com/competitions/used-car-price-prediction-competition
    * https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
    * https://www.kaggle.com/datasets/simtoor/mall-customers
    * https://www.kaggle.com/competitions/business-research-methods-world-university-rankings
    * https://www.kaggle.com/datasets/m5anas/cancer-patient-data-sets
        
5. Setup for homework submission.
    * Create a homework directory in your GitHub Repo where you will submit your solutions. 
    * Create a symbolic link from this directory to where you stored the downloaded datasets in step 4 below.
    * **Do not commit any datafiles into GitHub.**
        
### Exercises        

Perform all of these tasks on the Unix command prompt. Some may require several commands. Many will require chaining commands together. Once you figure out how to perform the task, copy paste the command(s) into a notebook that you will submit.


#### 1. Organize your dataset directory. Make a new directory for the original zip files, and move the files there. In case you accidentally mess up one of the CSV files, you'll be able unzip the data again. 

Hint: use `mkdir` and `mv` commands with appropriate wildcards.




    
    mkdir given_zips
    mv *.zip given_zips

In [12]:
!cd HW1_Datasets && ls

Heart_Disease_Prediction.csv
Housing.csv
Mall Customers.xlsx
cancer patient data sets.csv
car_web_scraped_dataset.csv
diabetes_prediction_dataset.csv
[34mgiven_zips[m[m
world all university rank and rank score.csv


In [20]:
!cd HW1_Datasets/given_zips && ls

breast-cancer.zip
cancer-patients-and-air-pollution-a-new-link.zip
customer-segmentation-tutorial-in-python.zip
heart-disease-prediction.zip
housing-price-prediction.zip
starbucks.zip
titanic-dataset.zip
weather-prediction.zip


#### 2. The "diabetes_prediction_dataset.csv" file has a lot of entries. Create 3 new CSV files, each with about 1/3 of the data.

Hints: 
* Use `head` to get first line.  
* First create 3 files with just the first line by redirecting output of `head` into a file using `>`.
* Use `wc` to count the number of entries
* Chain/pipe `head` and `tail` to select specific lines, redirecting output to append to the 3 files you created using `>>`.




In [27]:
!cd HW1_Datasets && wc diabetes_prediction_dataset.csv

  100001  142264 3810356 diabetes_prediction_dataset.csv


    100001 lines, subtract one for headers
**100,000 lines total**

In [118]:
    ## ^^ Help with formatting "middle" csv pipeline and debugging (original code was creating a csv with 100000+ rows)

#### 3. Create 2 new CSV files from `Heart_Disease_Prediction.csv`, one containing rows with "Presence" label and another with "Absence" label. Make sure that the first line of each file contains the field names. 

Hints: 
* Use `head` to get first line.  
* First create 2 files with just the first line by redirecting output of `head` into a file using `>`.
* Use `grep` to select lines that contain "Absence" or "Presence" and append the output to the appropriate file created in the previous step.




In [42]:
!cd HW1_Datasets && head heart_disease_presence.csv && echo " " && head heart_disease_absence.csv

Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
56,1,3,130,256,1,2,142,1,0.6,2,1,6,Presence
59,1,4,110,239,0,2,142,1,1.2,2,1,7,Presence
60,1,4,140,293,0,2,170,0,1.2,2,2,7,Presence
63,0,4,150,407,0,2,154,0,4,2,3,7,Presence
61,1,1,134,234,0,0,145,0,2.6,2,2,3,Presence
46,1,4,140,311,0,0,120,1,1.8,2,2,7,Presence
53,1,4,140,203,1,2,155,1,3.1,3,0,7,Presence
 
Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence
65,1,4,120,177,0,0,140,0,0.4,1,0,7,Absence
59,1,4,135,234,0,0,161,0,0.5,2,0,7,Absence
53,1,4,142,226,0,2,111,1,0,1,0,7,Absence
44,1,3,140,235,0,2,180,0,0,1

#### 4. What fraction of cars in `car_web_scraped_dataset.csv` have had no accidents?

Hints:
* Use `grep` to select the appropriate lines.
* Pipe the output of grep into `wc` (using `|`) to count the lines.




In [58]:
print("Fraction of cars with no accidents:",2223/2841)

Fraction of cars with no accidents: 0.7824709609292503


#### 5. Make the following replacements in `Housing.csv`, output the result into a new CSV:

* yes -> 1
* no -> 0
* unfurnished -> 0
* furnished -> 1
* semi-furnished -> 2
    
Hints:
* Use `sed` to do the replacement.
* Use pipes to chain multiple `sed` commands.
* To avoid replacing "unfurnished" or "semi-furnished" when performing the "furnished" replacement, try replacing ",furnished" with ",1".




In [88]:
    #^^Help figuring out sed syntax from LLM

In [65]:
!cd HW1_Datasets && head Housing_encoded.csv

price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
13300000,7420,4,2,3,1,0,0,0,1,2,1,1
12250000,8960,4,4,4,1,0,0,0,1,3,0,1
12250000,9960,3,2,2,1,0,1,0,0,2,1,2
12215000,7500,4,2,2,1,0,1,0,1,3,1,1
11410000,7420,4,1,2,1,1,1,0,1,2,0,1
10850000,7500,3,3,1,1,0,1,0,1,2,1,2
10150000,8580,4,3,4,1,0,0,0,1,2,1,2
10150000,16200,5,3,2,1,0,0,0,0,0,0,0
9870000,8100,4,1,2,1,1,1,0,1,2,1,1


#### 6. Create a new CSV files from `Mall_Customers`, removing "CustomerID" column.

Hints:
* Use `cut` command
* Default separator for `cut` is the space character. For CSV, you have to use option `-d ','`.




In [100]:
!cd HW1_Datasets && head Mall_Customers_noID.csv

Gender,Age,Annual Income (k$),Spending Score (1-100)
Male,19,15,39
Male,21,15,81
Female,20,16,6
Female,23,16,77
Female,31,17,40
Female,22,17,76
Female,35,18,6
Female,23,18,94
Male,64,19,3


#### 7. Create a new file that contains the sum of the following fields for each row:
    * Research Quality Score
    * Industry Score
    * International Outlook
    * Research Environment Score
    
Hints:
* Use `cut` to select the correct columns.
* Use `tr` to replace ',' with '+'.
* Pipe output into `bc` to compute the sum.




In [105]:
    #^^Help with interpreting/fixing Parse error with bc from LLM

In [108]:
!cd HW1_Datasets && head university_sums.csv

378.2
367.2
340.5
361.2
353.3
363.0
335.9
354.9
329.1
355.7


#### 8. Sort the `cancer patient data sets.csv` file by age. Make sure the output is a readable CSV file.

Hints:
* Use `sort` with `-n`, `-t`, and `-k` options. 

In [116]:
!cd HW1_Datasets && cut -d ',' -f 3 cancer_patient_dataset_sorted.csv | head -n 20

Age
14
14
14
14
14
14
14
14
14
17
17
17
17
17
17
17
17
17
17
