<a href="https://colab.research.google.com/github/python1999e/cioalba/blob/main/text_parsing_Unix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unix Command Practice for Text Parsing and Formatting in Colab

In this notebook, we will practice using Unix commands to parse, manipulate, and format plain text files directly within Google Colab.
We will work with a pathogen dataset that includes the Genus, Species, and Disease (in parentheses).
You will learn to use powerful Unix commands like `sort`, `cut`, `uniq`, `grep`, and `sed` to process the data and generate meaningful reports.

In this notebook, we will practice using Unix commands in Colab to parse, manipulate, and reformat text files. You will work with a pathogen dataset where each line includes the Genus, Species, and Disease. Commands like `sort`, `cut`, `uniq`, `grep`, and `sed` will be used to reformat and create reports from this data.


In [1]:
# Step 1: Get a file and View the content of the file to understand its structure
!wget https://raw.githubusercontent.com/PlantHealth-Analytics/learning_unix_in_colab/main/pathogens.txt
!cat pathogens.txt


--2026-01-14 03:24:38--  https://raw.githubusercontent.com/PlantHealth-Analytics/learning_unix_in_colab/main/pathogens.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4768 (4.7K) [text/plain]
Saving to: ‘pathogens.txt’


2026-01-14 03:24:39 (52.1 MB/s) - ‘pathogens.txt’ saved [4768/4768]

Phytophthora infestans (Late blight of potato)
Puccinia graminis (Wheat stem rust)
Fusarium oxysporum (Fusarium wilt)
Botrytis cinerea (Gray mold)
Magnaporthe oryzae (Rice blast)
Xanthomonas oryzae (Bacterial leaf blight of rice)
Erwinia amylovora (Fire blight)
Pseudomonas syringae (Bacterial speck of tomato)
Rhizoctonia solani (Root rot)
Sclerotinia sclerotiorum (White mold)
Alternaria solani (Early blight of tomato)
Cercospora beticola (Cercospora leaf spot)
Verticill

In [2]:
# Step 2 Sort the file alphabetically by the first word (Pathogen)
!sort pathogens.txt > sorted_pathogens.txt
#print("Sorted text:")
!cat sorted_pathogens.txt

Agrobacterium tumefaciens (Crown gall)
Albugo candida (White rust of crucifers)
Alternaria alternata (Alternaria leaf spot)
Alternaria brassicicola (Dark leaf spot)
Alternaria mali (Alternaria blotch)
Alternaria solani (Early blight of tomato)
Aphanomyces euteiches (Root rot of legumes)
Aspergillus flavus (Aflatoxin contamination)
Aureobasidium pullulans (Sooty mold)
Banana bunchy top virus (BBTV)
Barley yellow dwarf virus (BYDV)
Beet necrotic yellow vein virus (BNYVV)
Blumeria graminis (Powdery mildew)
Botrytis cinerea (Gray mold)
Brome mosaic virus (BMV)
Bursaphelenchus xylophilus (Pine wilt nematode)
Ceratocystis fagacearum (Oak wilt)
Ceratocystis fimbriata (Ceratocystis wilt)
Cercospora arachidicola (Early leaf spot of peanut)
Cercospora beticola (Cercospora leaf spot)
Citrus tristeza virus (CTV)
Cladosporium fulvum (Tomato leaf mold)
Clavibacter michiganensis (Bacterial canker)
Claviceps purpurea (Ergot)
Colletotrichum orbiculare (Anthracnose of cucurbits)
Colletotrichum spp. (Ant

In [3]:
# Step 3: Use `cut` to check the separator of the fields
#Check is tab separator is used or space separator is used.
#The simbol "|" will concatenate the next command head.
#it displays the first 5 lines to save space
!cut -f 1 sorted_pathogens.txt | head -n 5

Agrobacterium tumefaciens (Crown gall)
Albugo candida (White rust of crucifers)
Alternaria alternata (Alternaria leaf spot)
Alternaria brassicicola (Dark leaf spot)
Alternaria mali (Alternaria blotch)


Note that you can't separate the fields because the tab is not the separator in this file.

In [4]:
#check space
!cut -d ' ' -f 1 sorted_pathogens.txt | head -n 5

Agrobacterium
Albugo
Alternaria
Alternaria
Alternaria


Note that now you can obtain the first collumn (field).  In this case you found that the sperator is the space.

In [5]:
# Step 4: Use sed to replace the intial parenthesis "(" with tabs so we cna divide the file into two two
# fields (collumns) - pathogen disease
#Lets delete the parenthesis ")" with sed
#and format the file with headers
# Using echo to add headers and saving the output with '>'


#Replace "(" with tabs
##use the command sed.  syntax is : sed 's/keyword_to_replace/new_keywork/g'  Especial characteres like tab separator is : \t
!sed 's/(/\t/g' sorted_pathogens.txt | sed 's/)//g' | cut -f 1,2 > tab_separated_pathogens.txt

#Add headers using echo
!echo -e "Pathogen\tDisease" > formatted_report_with_headers.txt

#Append the formatted data to the file
!cat tab_separated_pathogens.txt >> formatted_report_with_headers.txt

#View the formatted report
!cat formatted_report_with_headers.txt

Pathogen	Disease
Agrobacterium tumefaciens 	Crown gall
Albugo candida 	White rust of crucifers
Alternaria alternata 	Alternaria leaf spot
Alternaria brassicicola 	Dark leaf spot
Alternaria mali 	Alternaria blotch
Alternaria solani 	Early blight of tomato
Aphanomyces euteiches 	Root rot of legumes
Aspergillus flavus 	Aflatoxin contamination
Aureobasidium pullulans 	Sooty mold
Banana bunchy top virus 	BBTV
Barley yellow dwarf virus 	BYDV
Beet necrotic yellow vein virus 	BNYVV
Blumeria graminis 	Powdery mildew
Botrytis cinerea 	Gray mold
Brome mosaic virus 	BMV
Bursaphelenchus xylophilus 	Pine wilt nematode
Ceratocystis fagacearum 	Oak wilt
Ceratocystis fimbriata 	Ceratocystis wilt
Cercospora arachidicola 	Early leaf spot of peanut
Cercospora beticola 	Cercospora leaf spot
Citrus tristeza virus 	CTV
Cladosporium fulvum 	Tomato leaf mold
Clavibacter michiganensis 	Bacterial canker
Claviceps purpurea 	Ergot
Colletotrichum orbiculare 	Anthracnose of cucurbits
Colletotrichum spp. 	Anthracnose

In [6]:
# Step 5: Filter diseases related to a specific keyword using `grep` (e.g., "mildew")
!grep "mildew" formatted_report_with_headers.txt


Blumeria graminis 	Powdery mildew
Erysiphe necator 	Powdery mildew of grape
Leveillula taurica 	Powdery mildew of pepper
Peronospora destructor 	Downy mildew of onion
Peronospora parasitica 	Downy mildew
Pseudoperonospora cubensis 	Downy mildew of cucurbits
Sphaerotheca fuliginea 	Powdery mildew of cucurbits
Sphaerotheca pannosa 	Powdery mildew of rose


## **Exercise 1: Count the Number of Pathogens for Each Disease Type**

In [7]:

# Task: Count the number of pathogens associated with a specific disease (e.g., "wilt")

# Step 1: Use grep to find pathogens related to "wilt"
!grep "wilt" formatted_report_with_headers.txt

# Step 2: Count the number of lines that contain "wilt"
!grep -c "wilt" formatted_report_with_headers.txt


Bursaphelenchus xylophilus 	Pine wilt nematode
Ceratocystis fagacearum 	Oak wilt
Ceratocystis fimbriata 	Ceratocystis wilt
Fusarium oxysporum 	Fusarium wilt
Ralstonia solanacearum 	Bacterial wilt
Tomato spotted wilt virus 	TSWV
Verticillium albo-atrum 	Verticillium wilt
Verticillium dahliae 	Verticillium wilt
8


## **Exercise 2: Find and Extract Specific Genus Using `grep` and `awk`**

In [8]:

# Task: Extract all entries related to a specific genus (e.g., "Xanthomonas")

# Step 1: Find all pathogens related to "Xanthomonas"
!echo "result with grep"
!grep "^Xanthomonas" formatted_report_with_headers.txt
!echo "..............................................."
!echo "result with awk"
!awk '$1 == "Xanthomonas"' formatted_report_with_headers.txt

result with grep
Xanthomonas axonopodis 	Citrus canker
Xanthomonas campestris 	Black rot of crucifers
Xanthomonas oryzae 	Bacterial leaf blight of rice
Xanthomonas vesicatoria 	Bacterial spot of pepper
...............................................
result with awk
Xanthomonas axonopodis 	Citrus canker
Xanthomonas campestris 	Black rot of crucifers
Xanthomonas oryzae 	Bacterial leaf blight of rice
Xanthomonas vesicatoria 	Bacterial spot of pepper


# Exercise 3: Reformat the sorted pathogen list into CSV format
We will reformat the data so that it separates Genus, Species, and Disease by commas using `awk`. CSV format is commonly used in data science applications.

In [9]:

# Task: Reformat the sorted pathogen list into CSV format

# Step 1: Use awk to output Genus, Species, and Disease in CSV format

! sed 's/ (/,/g' sorted_pathogens.txt | sed 's/)//g'  > pathogens.csv

# Step 2: View the CSV-formatted data

!cat pathogens.csv




Agrobacterium tumefaciens,Crown gall
Albugo candida,White rust of crucifers
Alternaria alternata,Alternaria leaf spot
Alternaria brassicicola,Dark leaf spot
Alternaria mali,Alternaria blotch
Alternaria solani,Early blight of tomato
Aphanomyces euteiches,Root rot of legumes
Aspergillus flavus,Aflatoxin contamination
Aureobasidium pullulans,Sooty mold
Banana bunchy top virus,BBTV
Barley yellow dwarf virus,BYDV
Beet necrotic yellow vein virus,BNYVV
Blumeria graminis,Powdery mildew
Botrytis cinerea,Gray mold
Brome mosaic virus,BMV
Bursaphelenchus xylophilus,Pine wilt nematode
Ceratocystis fagacearum,Oak wilt
Ceratocystis fimbriata,Ceratocystis wilt
Cercospora arachidicola,Early leaf spot of peanut
Cercospora beticola,Cercospora leaf spot
Citrus tristeza virus,CTV
Cladosporium fulvum,Tomato leaf mold
Clavibacter michiganensis,Bacterial canker
Claviceps purpurea,Ergot
Colletotrichum orbiculare,Anthracnose of cucurbits
Colletotrichum spp.,Anthracnose
Corynespora cassiicola,Target spot
Cucumbe

## **Exercise 4: Extract Unique Genera Using `cut` and `sort`**

In [10]:

# Task: Extract and list all unique genera from the dataset

# Step 1: Use cut to extract the first field (Genus) from the file
!cut -d ' ' -f 1 sorted_pathogens.txt > genera.txt

# Step 2: Sort and remove duplicates to get the unique genera

!sort genera.txt | uniq > unique_genera.txt
#count how many unique genera
!cat unique_genera.txt | wc -l

# Step 3: Detail of how many unique genera

!sort genera.txt | uniq -c

#All-in-one  . Use "|" to link commands in one single line. For example:
!cut -d ' ' -f 1 sorted_pathogens.txt | uniq | wc -l

102
      1 Agrobacterium
      1 Albugo
      4 Alternaria
      1 Aphanomyces
      1 Aspergillus
      1 Aureobasidium
      1 Banana
      1 Barley
      1 Beet
      1 Blumeria
      1 Botrytis
      1 Brome
      1 Bursaphelenchus
      2 Ceratocystis
      2 Cercospora
      1 Citrus
      1 Cladosporium
      1 Clavibacter
      1 Claviceps
      2 Colletotrichum
      1 Corynespora
      1 Cucumber
      1 Diaporthe
      1 Didymella
      1 Diplocarpon
      1 Ditylenchus
      1 Elsinoë
      1 Erwinia
      1 Erysiphe
      1 Eutypa
      1 Exserohilum
      2 Fusarium
      1 Fusicladium
      1 Gaeumannomyces
      1 Gibberella
      1 Globodera
      1 Glomerella
      1 Grapevine
      1 Guignardia
      1 Hemileia
      1 Heterodera
      1 Lasiodiplodia
      1 Leptosphaeria
      1 Leveillula
      1 Macrophomina
      1 Magnaporthe
      1 Maize
      1 Marssonina
      1 Meloidogyne
      1 Monilinia
      1 Moniliophthora
      1 Mycosphaerella
      1 Myrothecium