# Modifying lines

Sometimes it is not enough to extract only the lines that contain some information of interest. In many cases, you want to extract a very specific piece of information and not anything else.

Take a look at the file in this directory. It contains the domain annotation of several proteins in [XML](https://en.wikipedia.org/wiki/XML) format. Try to extract all lines that contain the "feature type" annotation of the different domains using `grep`. 

In this exercise we will learn how to modify these lines so that we can extract only the name of the feature type.

## The sed command - a BASH powerhouse

You have already learned about one command with which you can modify the content of a line. Using the `cut` command and different delimiters, you can already extract quite a lot of information from most files.

Although the `cut` command is powerful in its own right, the `sed` command opens up all possibilities of regular expressions for your data extraction.

Not only that, in its core the `sed` command is used to **replace** one pattern with another. Look below for an example of the `sed` syntax:

In [2]:
%%bash
# you already know this one
cat ejemplo.txt

XXXXXXXXXXXXX
aaaaa	xxxxx
xxxxx	bbbbb
ccccc	xxxxx
xxxxx	ddddd
eeeee	xxxxx
aaaaa	bbbbb
....	fffff
axaxa	bxbxb
XXXXXXXXXXXXX

In [4]:
%%bash
sed 's/x/y/' ejemplo.txt
# the s at the beginning of the expression stand for "substitute"

XXXXXXXXXXXXX
aaaaa	yxxxx
yxxxx	bbbbb
ccccc	yxxxx
yxxxx	ddddd
eeeee	yxxxx
aaaaa	bbbbb
....	fffff
ayaxa	bxbxb
XXXXXXXXXXXXX

In [5]:
%%bash
# replace ALL occurrences of the pattern, not only the first one (the g stands for global)
sed 's/x/y/g' ejemplo.txt

XXXXXXXXXXXXX
aaaaa	yyyyy
yyyyy	bbbbb
ccccc	yyyyy
yyyyy	ddddd
eeeee	yyyyy
aaaaa	bbbbb
....	fffff
ayaya	bybyb
XXXXXXXXXXXXX

In [8]:
%%bash
# sed can use regular expressions and special characters
sed 's/.\tb/@/' ejemplo.txt

XXXXXXXXXXXXX
aaaaa	xxxxx
xxxx@bbbb
ccccc	xxxxx
xxxxx	ddddd
eeeee	xxxxx
aaaa@bbbb
....	fffff
axax@xbxb
XXXXXXXXXXXXX

In [9]:
%%bash
# you can use sed to remove parts of the line, just subsitute for nothing
sed 's/^..//' ejemplo.txt

XXXXXXXXXXX
aaa	xxxxx
xxx	bbbbb
ccc	xxxxx
xxx	ddddd
eee	xxxxx
aaa	bbbbb
..	fffff
axa	bxbxb
XXXXXXXXXXX

In [12]:
%%bash
# sed works with other delimiters as well, not only slashes
sed 's|X|banana|g' ejemplo.txt

bananabananabananabananabananabananabananabananabananabananabananabananabanana
aaaaa	xxxxx
xxxxx	bbbbb
ccccc	xxxxx
xxxxx	ddddd
eeeee	xxxxx
aaaaa	bbbbb
....	fffff
axaxa	bxbxb
bananabananabananabananabananabananabananabananabananabananabananabananabanana

## How to use sed for data extraction

Going back to data extraction. Suppose you want to extract only the feature type name from a line like this:

`                <feature type="Gmad2" instance="3" clan="Gmad2" evalue="1.8e-07">`

Just follow these steps to extract the information you need:
1. Identify your piece of information in the line (it would be Gmad2 in this case)
2. Look at all the information left of it and think of a regular expression that describes its pattern
3. Remove the pattern with sed
4. Look to the right of the information, think of a pattern, and remove

Think of `sed` as a pair of clippers. Snipping away everything that you are not interested in, piece by piece.

## Tasks

1. How many different feature types can be found, and which ones are they?

In [None]:
%%bash


2. Is any of these features repeated in 10 or more instances in any of the proteins? Which features are these?

In [None]:
%%bash
