# Introduction to Python  

## Regular Expressions Tutorial

Inspired in these sources: [1](https://github.com/KennyMiyasato/regex_blog_post/blob/master/regex_blog_post.ipynb), [2](https://luca-pessina.medium.com/python-regular-expressions-in-5-minutes-ecc8b6624308), [3](https://developers.google.com/edu/python/regular-expressions)  
Tables from: [regexr.com](https://regexr.com/)

#### Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.

#### In Python a regular expression search is typically written as:

> match = re.search(pattern, str)


### Importing modules

In [49]:
import re
import pandas as pd

### Defining sample strings 

In [46]:
lowercase_alphabet = "abcdefghijklmnopqrstuvwxyz"
uppercase_alphabet = lowercase_alphabet.upper()
numbers = "1234567890"
sentence = "The Quick Brown Fox Jumps Over The Lazy Dog"
website = "https://webmail2016.univie.ac.at/"
phone_numbers = """123-456-7890
                    987.654.321
                    234-567-8901
                    654.321.987
                    345-678-9012
                    321.654.978
                    456-789-0123
                """
special_characters = "[\^$.|?*+()"

text_with_email = 'xyz alice-b@google.com purple monkey'

long_text = '''
Nicht alles, was Gold ist, funkelt
Nicht alles, was Gold ist, funkelt,
Nicht jeder, der wandert, verlorn,
Das Alte wird nicht verdunkelt,
Noch Wurzeln der Tiefe erfroren.
Aus Asche wird Feuer geschlagen,
Aus Schatten geht Licht hervor;
Heil wird geborstenes Schwert,
Und König, der die Krone verlor.
'''

### Match explicit character(s)
In order to match characters explicitly, all you need to do is type what you'd like to find.

In [12]:
re.findall("abc", lowercase_alphabet)

['abc']

In [13]:
re.findall("ABC", uppercase_alphabet)

['ABC']

In [14]:
re.findall("abc", uppercase_alphabet)

[]

### Match with special character(s)
In order to match any *special characters `[\^$.|?*+()`* you have first introduce a backslash `\` followed by the character you'd like to select.

In [15]:
re.findall("webmail2016\.univie\.ac\.at", website)

['webmail2016.univie.ac.at']

In [7]:
re.findall("\$", special_characters)

['$']

In [8]:
re.findall("\|", special_characters)

['|']

### Match by pattern
There are a lot of ways we can match a pattern. Regex has its own syntax so we could pick and choose how we want our patterns to look like.

### Character Classes
| Class | Explanation |
|---|---|
| . | any character except newline |
| \w \d \s | word, digit, whitespace |
| \W \D \S | not word, not digit, not whitespace |
| [abc] | any of a, b, or c |
| [^abc] | note a, b, or c |
| [a-g] | characters between a & g |

[0-9] is not always equivalent to \d. 
In python3, [0-9] matches only 0123456789 characters, while \d matches [0-9] and other digit characters, for example Eastern Arabic numerals ٠١٢٣٤٥٦٧٨٩.

### Quantifiers & Alternation
| Class | Explanation |
|---|---|
| a* a+ a? | 0 or more, 1 or more, 0 or 1 |
| a{5} a{2,} | exactly five, two or more |
| a{1,3} | between one & three |
| a+? a{2,}? | match as few as possible |
| ab\|cd | match ab or cd |


In [9]:
re.findall("\w{1,}", sentence)

['The', 'Quick', 'Brown', 'Fox', 'Jumps', 'Over', 'The', 'Lazy', 'Dog']

In [20]:
re.findall("\W+", sentence)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [23]:
re.findall("[The]+", sentence)

['The', 'e', 'The']

In [19]:
re.findall("[^TL]{1,}", sentence)

['he Quick Brown Fox Jumps Over ', 'he ', 'azy Dog']

In [25]:
re.findall("[a-r]{1,}", sentence)

['he', 'ick', 'ro', 'n', 'o', 'mp', 'er', 'he', 'a', 'og']

### Anchors
| Class | Explanation |
|---|---|
| ^abc$ | start / end of the string |
| \b \B | character between a & g |

In [95]:
re.findall("^[Te]", sentence)

['T']

In [100]:
re.findall("[a-z,\.]{1,}$", long_text)

['verlor.']

### Escaped Characters
| Class | Explanation |
|---|---|
| \\. \\* \\\ | escaped special characters |
| \\t \\n \\r | tab, linefeed, carriage return |

In [35]:
re.findall("\d{3}\-\d{3}\-\d{4}", phone_numbers)

['123-456-7890', '234-567-8901', '345-678-9012', '456-789-0123']

In [34]:
re.findall("\n.{5}", long_text)

['\nNicht',
 '\nNicht',
 '\nNicht',
 '\nDas A',
 '\nNoch ',
 '\nAus A',
 '\nAus S',
 '\nHeil ',
 '\nUnd K']

### Groups & Lookaround
| Class | Explanation |
|---|---|
| (abc) | capture group |
| \1 | backreference to group #1 |
| (?:abc) | non-capturing group |
| (?=abc) | positive lookahead |
| (?!abc) | negative lookahead |

In [47]:
match = re.search('([\w.-]+)@([\w.-]+)', text_with_email)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


In [90]:
my_string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', my_string)
print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
    print(tuple[0])  ## username
    print(tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


### Substituting characters

In [None]:
https://luca-pessina.medium.com/python-regular-expressions-in-5-minutes-ecc8b6624308

### Regular Expressions with Pandas

In [57]:
fragebogen = pd.read_csv('../Data/CSV/fragebogen.csv', names=["id",
                                                              "nummer", 
                                                              "titel", 
                                                              "schlagwoerter", 
                                                              "erscheinungsjahr",
                                                              "autoren", 
                                                              "originaldaten",
                                                              "anmerkung",
                                                              "freigabe",
                                                              "checked",
                                                              "wordleiste",
                                                              "druck",
                                                              "online",
                                                              "publiziert",
                                                              "fragebogen_typ_id",])
fragebogen.drop(["id","schlagwoerter","erscheinungsjahr","autoren","originaldaten","anmerkung","freigabe",
                 "checked","wordleiste","druck","online","publiziert","fragebogen_typ_id",], inplace=True, axis=1)

fragebogen.set_index("nummer", drop=True, inplace=True)
fragebogen = fragebogen[fragebogen.titel.str.startswith('Fragebogen')]

In [58]:
fragebogen.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109 entries, 1 to 109
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   titel   109 non-null    object
dtypes: object(1)
memory usage: 1.7+ KB


In [59]:
fragebogen.head(20)

Unnamed: 0_level_0,titel
nummer,Unnamed: 1_level_1
1,Fragebogen 1: Kopf (1)
2,Fragebogen 2: Die Osterwoche (1)
3,Fragebogen 3: Die Osterwoche (2)
4,Fragebogen 4: Kopf (2)
5,Fragebogen 5: Zeit zwischen Ostern und Fronlei...
6,Fragebogen 6: Menschl. Haar und Bart (= Kopf 3)
7,Fragebogen 7: Hochzeit (1)
8,Fragebogen 8: Hochzeit (2)
9,Fragebogen 9: Hochzeit (3)
10,Fragebogen 10: Hochzeit (4)


In [60]:
regex1 = r'([Fragebon]+)\s{1}([0-9]+)[:]{1}([,A-ZÄÖÜa-zäöüß0-9.\s]+)[,\s]*([=\-\(\)\sA-ZÄÖÜa-zäöüß0-9]*)'

fragebogen.titel.str.extract(regex1).head()

Unnamed: 0_level_0,0,1,2,3
nummer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Fragebogen,1,Kopf,(1)
2,Fragebogen,2,Die Osterwoche,(1)
3,Fragebogen,3,Die Osterwoche,(2)
4,Fragebogen,4,Kopf,(2)
5,Fragebogen,5,Zeit zwischen Ostern und Fronleichnam,


In [62]:
fragebogen.loc[:,'fragebogen_num'] = fragebogen.titel.str.extract(regex1)[1].str.strip()
fragebogen.loc[:,'fragebogen_headwords'] = fragebogen.titel.str.extract(regex1)[2].str.strip()
fragebogen.loc[:,'fragebogen_series'] = fragebogen.titel.str.extract(regex1)[3].str.strip()

In [63]:
fragebogen.head(20)

Unnamed: 0_level_0,titel,fragebogen_num,fragebogen_headwords,fragebogen_series
nummer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Fragebogen 1: Kopf (1),1,Kopf,(1)
2,Fragebogen 2: Die Osterwoche (1),2,Die Osterwoche,(1)
3,Fragebogen 3: Die Osterwoche (2),3,Die Osterwoche,(2)
4,Fragebogen 4: Kopf (2),4,Kopf,(2)
5,Fragebogen 5: Zeit zwischen Ostern und Fronlei...,5,Zeit zwischen Ostern und Fronleichnam,
6,Fragebogen 6: Menschl. Haar und Bart (= Kopf 3),6,Menschl. Haar und Bart,(= Kopf 3)
7,Fragebogen 7: Hochzeit (1),7,Hochzeit,(1)
8,Fragebogen 8: Hochzeit (2),8,Hochzeit,(2)
9,Fragebogen 9: Hochzeit (3),9,Hochzeit,(3)
10,Fragebogen 10: Hochzeit (4),10,Hochzeit,(4)


## Exercises

### Twitter:
    
#### In twitter a username is validated if:

+ Is a text with a ‘@’ at the beginning so in regex ‘^@’
+ Contains only alphanumerical characters (letters A-Z, numbers 0–9). 
    + In this case we can use either ‘[A-Za-z0–9]’ or ‘[\w\d]’. 
    + We use the brackets because we want to apply another quantifiers to the set.
+ It must be between 4 and 15 characters, so {4,15}.

You can use the re.match() or re.search() methods to find a match for your string, in the following code is reported the selection throught the regex statement.

In [66]:
import sqlite3
conn = sqlite3.connect('../Data/tweet_database.sqlite')
cur = conn.cursor()
#df = pd.read_sql(conn, text)

In [73]:
q1 = "SELECT name FROM sqlite_master WHERE type IN ('table','view') AND name NOT LIKE 'sqlite_%' ORDER BY 1"
cur.execute(q1)
cur.fetchall()

[('Sentiment',)]

In [79]:
q2 = "SELECT * from Sentiment limit 1"
cur.execute(q2)
print([f[0] for f in cur.description])

['id', 'candidate', 'candidate_confidence', 'relevant_yn', 'relevant_yn_confidence', 'sentiment', 'sentiment_confidence', 'subject_matter', 'subject_matter_confidence', 'candidate_gold', 'name', 'relevant_yn_gold', 'retweet_count', 'sentiment_gold', 'subject_matter_gold', 'text', 'tweet_coord', 'tweet_created', 'tweet_id', 'tweet_location', 'user_timezone']


In [80]:
cur.fetchall()

[(1,
  'No candidate mentioned',
  1,
  'yes',
  1,
  'Neutral',
  0.6578,
  'None of the above',
  1,
  '',
  'I_Am_Kenzi',
  '',
  5,
  '',
  '',
  'RT @NancyLeeGrahn: How did everyone feel about the Climate Change question last night? Exactly. #GOPDebate',
  '',
  '2015-08-07 09:54:46 -0700',
  629697200650592256,
  '',
  'Quito')]

In [87]:
q3 = "SELECT text from Sentiment limit 50"
cur.execute(q3)
cur.fetchmany(3)

[('RT @NancyLeeGrahn: How did everyone feel about the Climate Change question last night? Exactly. #GOPDebate',),
 ("RT @ScottWalker: Didn't catch the full #GOPdebate last night. Here are some of Scott's best lines in 90 seconds. #Walker16 http://t.co/ZSfF…",),
 ('RT @TJMShow: No mention of Tamir Rice and the #GOPDebate was held in Cleveland? Wow.',)]

In [86]:
tweets = cur.execute(q3).fetchall()
tweets = [t[0] for t in tweets]
tweets

['RT @NancyLeeGrahn: How did everyone feel about the Climate Change question last night? Exactly. #GOPDebate',
 "RT @ScottWalker: Didn't catch the full #GOPdebate last night. Here are some of Scott's best lines in 90 seconds. #Walker16 http://t.co/ZSfF…",
 'RT @TJMShow: No mention of Tamir Rice and the #GOPDebate was held in Cleveland? Wow.',
 "RT @RobGeorge: That Carly Fiorina is trending -- hours after HER debate -- above any of the men in just-completed #GOPdebate says she's on …",
 'RT @DanScavino: #GOPDebate w/ @realDonaldTrump delivered the highest ratings in the history of presidential debates. #Trump2016 http://t.co…',
 'RT @GregAbbott_TX: @TedCruz: "On my first day I will rescind every illegal executive action taken by Barack Obama." #GOPDebate @FoxNews',
 'RT @warriorwoman91: I liked her and was happy when I heard she was going to be the moderator. Not anymore. #GOPDebate @megynkelly  https://…',
 'Going on #MSNBC Live with @ThomasARoberts around 2 PM ET.  #GOPDebate',
 'Deer

### Extract:
+ mentions
+ URLs
+ Hashtags