The binding site for nilotinib on the tyrosine kinase protein ABL1 is primarily located within the ATP-binding pocket of the protein. 
Nilotinib is a type of tyrosine kinase inhibitor that specifically targets the BCR-ABL fusion protein, which is commonly found in 
certain types of leukemia, such as chronic myelogenous leukemia (CML). By binding to the ATP-binding pocket of the ABL1 kinase domain, nilotinib inhibits the tyrosine kinase activity of the protein, thereby interfering with the signaling pathways that promote cancer cell growth and proliferation.

The ATP-binding pocket of the tyrosine kinase protein ABL1 contains a conserved sequence motif known as the "hinge region."
The amino acid sequence within the ATP-binding pocket of the ABL1 kinase domain includes residues that are critical for ATP
binding and kinase activity. The specific amino acid sequence varies depending on the region within the ATP-binding pocket. 
Here is a general representation of the amino acid sequence motif within the ATP-binding pocket of tyrosine kinase proteins like ABL1:

GXGXXG (N/C)..... (D/E)

In this motif:
- "G" represents glycine
- "X" represents any amino acid
- "(N/C)" indicates either asparagine (N) or cysteine (C)
- "(D/E)" indicates either aspartic acid (D) or glutamic acid (E)

This motif is characteristic of the ATP-binding pockets of many tyrosine kinases, including ABL1. 
The specific residues and their positions within the ABL1 ATP-binding pocket would need to be 
confirmed through structural analysis or experimental studies.  Create a regex that will find this binding domain.

In [1]:
import re

ABL1 = """>pdb|3QRK|A Chain A, Tyrosine-protein kinase ABL1
TSMDPSSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKE
IKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHR
DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEI
ATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQE

>pdb|3QRJ|B Chain B, Tyrosine-protein kinase ABL1
TSMDPSSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKE
IKHPNLVQLLGVCTREPPFYIIIEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHR
DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEI
ATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQE

>pdb|3QRJ|A Chain A, Tyrosine-protein kinase ABL1
TSMDPSSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKE
IKHPNLVQLLGVCTREPPFYIIIEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHR
DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEI
ATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQE

>pdb|3QRI|B Chain B, Tyrosine-protein kinase ABL1
TSMDPSSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKE
IKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHR
DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEI
ATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQE

>pdb|3QRI|A Chain A, Tyrosine-protein kinase ABL1
TSMDPSSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKE
IKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHR
DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEI
ATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQE

>pdb|3EGU|A Chain A, Proto-oncogene tyrosine-protein kinase ABL1
MENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSAYITPVNS

>pdb|3CS9|D Chain D, Proto-oncogene tyrosine-protein kinase ABL1
GAMDPSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEI
KHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRD
LAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIA
TYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQES

>pdb|3CS9|C Chain C, Proto-oncogene tyrosine-protein kinase ABL1
GAMDPSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEI
KHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRD
LAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIA
TYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQES

>pdb|3CS9|B Chain B, Proto-oncogene tyrosine-protein kinase ABL1
GAMDPSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEI
KHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRD
LAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIA
TYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQES

>pdb|3CS9|A Chain A, Proto-oncogene tyrosine-protein kinase ABL1
GAMDPSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEI
KHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRD
LAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIA
TYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQES
"""

I am defining TyrKin and ABL1 by the two files attached with the homework. 

In [2]:
TyrKin = """>WP_329174884.1 MULTISPECIES: diacylglycerol kinase family protein [unclassified Streptomyces]
MRALLVVNPAATTTSARVRDVISTALASDLKLEVATTEYRGHARDLARRAVEGGEVELVVSLGGDGTVNE
IVNGLLHHGPDPDRLPQLAVVPGGSTNVFARALGLPNDAVEATGALLDALRAGSSRTVGLGLVSGTPGTE
DEGVPDRWFTFCAGLGFDAGVVGRVEQQRELGKRSTHALYLRQVARQFVADQHRRHGTITLERPGEEPVA
GLMTAIVCNTAPWTYLGNRPVYASPDASFDTALDLLGVSKMSTLSGTRYAAQLLRSTPERGPRGKHTVAL
HDLTDFTLQSQAALPFQVDGDHLGLRTSASFTGVRRALRVIV

>WP_298819651.1 CpsD/CapB family tyrosine-protein kinase [Chloroflexus sp.]
MFGWLKNKTNTTVADPPLVIETADGALRRAFAGEQIGQLRRMLTDLLVGQRLPNRIGFTSALRNEGVTCI
TLASAVTLAHDTGKRVCVVELNWVAPGMLANLRPLATPEVKSKRKAAQPEEVAPPLPVLPGVADVLRGQA
TLDETLLATNYAGLRLLPAGVAALEQRPLLARGTELRNLLDHLNTHFDYVLLDLPAVLETSDTLALAALA
TAYAMVVRHGVTPVTEVRRALNDLQHVPVLGVILNQAHIATPRWIHRLIPQE

>WP_298815715.1 CpsD/CapB family tyrosine-protein kinase [Chloroflexus sp.]
MLVTSSPEQVLITLREPASAAAEAYRTLRTNILFSSLDKPIHTLLLTAAEPTPEKSLTAANLAVTMAQAE
QRVLLVDCDLRQPMLHTIFGLSNEHGLTSSILDQDAPLAIQPTEVAGLSLLPSGPLPPRPADLLGSRRME
HLLARLRQTADIVIFDTPPVQNVTDALVLATRVDGVLLVVQAGRSRRDRVREARQKLEKVKANVLGVVLS
GART

>WP_298003881.1 CpsD/CapB family tyrosine-protein kinase, partial [Anaerolinea sp.]
SPVTESFRSLRTNVRYASVDHPIRVLMVTSPTPEDGKTTVAINLAVVIAQSGNRCVLIDSDLRRPRIHKI
MGLDNRQGLSGLFIQNPLSLNGTIQQTRLETLKVIPSGGTPPTPSELLGSRKMKEIVEAVANDTDMIILD
TPPVLSVTDAAVLSPLVDGVLLVVKVGTTKQSALIQAVEQLQQVNARILGVVVNGISPKNSHYYYRNYAY
YQYYYYGDGGNVKKKRKSAVSEPSSQPASINPPSKPISRQ

>WP_288310007.1 diacylglycerol kinase family protein [Prevotella pectinovora]
MPQKKIAFIMNPISGTVKKAGIPKLIDEYLDKDTFSYSILNTEYAGHATLLARKACEDNCDVVVAVGGDG
TINEIGRALIDTNTAMGIIPCGSGNGLARHLMLPMDIKKCIMIINKCMTHDLDYGVIDGHPFFCTCGMGF
DAYISMRFAEAGKRGPVTYLENILKSGLKYEAEEYTIEAEEIGTIHKRAYLVSCANASQYGNNAYIAPQA
SMSDGLLDVIIMEPFDMIDAPQISIDMFNKTINKNPKVKTFKSSHILVHRKNEGVIHYDGDPVIAGKDID
ITLHHKGIKIVVNPEADKSRRKPNMVQNAFCDLFNDIYAVRSDITKQGKHIQTINKLLLKKLNK

>WP_286828957.1 MULTISPECIES: diacylglycerol kinase family protein [Kordiimonas]
MSERVAIIRNRFAGQGKSSLVDRIAEKLRRGGQHVRMLDTEYAGHATELATKIAATGEADVIVAAGGDGT
IREVAEGAYGHKVPVGIIPAGSANVLARELGYMKSGQVSARHVANALLSRDIVDLYPFEVERDGRVQLGL
CWVGVGFDAEVLRNVSPSLKEKIGRASFVPAIIQALVGDSSLPDITWHMHRNTKGTCGWALLSNIRRYAG
PFIVTKKTTYNSHGLACLMSQGHGAWPRILDQLAIGLRRFDKRAGVWVLEKATLHLGSDQTPVQMDGDFL
GFGEVAVTPKKHPLPFRAFVRR

>WP_283268727.1 MULTISPECIES: diacylglycerol kinase family protein [unclassified Acinetobacter]
MSQNLRPLSLIYNQKSGFHAAQQDEQYERLMTLWTQYGFEIQVFELNQQVNFDEMMTSVLSRHQQADLRG
VVVAAGGDGTLNAVAKKLMHTDIPMGIMPLGTFNYVARALNIPIDLGLAAEVIATGIEHKVHVATLNDQI
YLNNASLGLYPLFIKKRELYNQRFGRFPLNAYTSGLDVLLREHKSLKLAVTVDGQKYPVATPLVFFGNNQ
LQLSDMKLRIAECAANGKLAGVVVAKSDRLSLLKMLWTLIQGKIDQASDVYSFCAEQIQVDCKKKKLTVA
VDGELMEMQTPLKFAVEKSALKVMVPNAVASV

>WP_248460666.1 MULTISPECIES: diacylglycerol kinase family protein [Acetobacteraceae]
MQKLMNVVQSALIHNPNSRKNAQDKGQFVRMARQKMGDFCVSALNDSHLPAHLTELKSRGVDLIAISGGD
GTVSACLTAIASVYHDCPLPSVAILPSGNTNLIAGDVGFGLHGEAAIDRLLRPEGLKSCIRTPIRLSWPD
GERPSVLGMFGGCTGYARAVRIAHSPNVLKFAPHDLAVFFTLFSSMASLLFRRSRHSWMNGDRLSWSAEG
TGLPVPGREGRSFLFMTTALEKLSHGIWPFWNEDTSRDGFHFLDVHAFPEALPKACFNLLRGRAPEWLRA
HKDYTSATARAMQLETDSDFVLDGEVFPVSAGGRLKLEEGPAFRFVHV

>WP_202283271.1 MULTISPECIES: CpsD/CapB family tyrosine-protein kinase [Bacillus cereus group]
MFGRKKRKPLRQLITHKEPKSRITEQYRNIRTNIEFTSIDNHIRSIVVTSANPGEGKTTTISNLAVVFGQ
QGKKVLLIGADLRKPTIQNLFAVHSPNGLTNLLSGQAKFMQCIQKTDIENVYVMAAGPIPPNPAELLGSR
AMDELLLEAYNMFDIILIDTPPVLAVTDAQILANKCDGSVLVVRSEKTEKDKIVKTKQILDKASGKLLGV
VLNDKREEKGQYGYY

>WP_134995977.1 CpsD/CapB family tyrosine-protein kinase [Bacillus cereus]
MALNLFKKKKNHRQRRQLIAHQQPKSPISEQYRNIRTNIEFAAIDTNLHSLMVTSANPSEGKTTTTANMA
VVFAQQGKKVLLIDADMRKPAMHQMFQVDNIFGLTNVLTNSERLEKCVQTTSVDNLHFLACGPIPPNPAE
LLGSKSMKELLGQAYSIYDLVIFDMPPILAVTDAQIMANVCDASILVVRSESTEKETAVKAKGLLESAKG
KLLGVVLNDREREEGLYYYYGAN

"""

Below is the search for the orignial pattern (later noted to be incorrect in lecture on 3/26)

In [3]:
pattern = r'G.G.{2}G[NC].{0,5}[DE]' 
#GXGXXG(N/C).....(D/E), from the original question
#G- matches the character G
#. - matches character except newline
#G- matches the character G
#. {2}- matches character except newline, twice
#G- matches the character G
#[NC]- matches either N or C
#.{0,5}- matches any character except newline 0-5 times
#[DE]- matches either D or E
match = re.search(pattern, TyrKin, flags=re.DOTALL)
if match:
    print('There is a match')
else:
    print('There are no matches')

match = re.search(pattern, ABL1, flags=re.DOTALL)
if match:
    print('There is a match')
else:
    print('There are no matches')

There are no matches
There are no matches


In [4]:
matches = re.finditer(pattern, ABL1)
#using finditer to find and iterate over matches over multiple lines

for match in matches:
    print(f'match {match.group()}\n')



In [5]:
matches = re.finditer(pattern, TyrKin)
#using finditer to find and iterate over matches over multiple lines

for match in matches:
    print(f'match {match.group()}\n')

Here below is the corrected pattern provided after the lecture on 3/26

In [11]:
#GXXXAGG is the given pattern
#in the same manner as above I will define my regex pattern 
corrPattern = r'G...AG{2}'
match = re.search(corrPattern, TyrKin)
if match:
    print('There is a match')
else:
    print('There are no matches')

match = re.search(corrPattern, ABL1)
if match:
    print('There is a match')
else:
    print('There are no matches')

#still do not appear to be any matches. 

There are no matches
There are no matches
