# ლექცია 7 - მონაცემების რამოდენიმე ფაილის გაერთიანება სხვადასხვა მეთოდით. სხვადასხვა ტრანსფორმაციები სვეტებზე. ტექსტურ მონაცემებზე მანიპულაციები.

Much of the programming work in data analysis and modeling is spent on data preparation:
loading, cleaning, transforming, and rearranging. Sometimes the way that data
is stored in files or databases is not the way you need it for a data processing application. pandas along with the Python standard library provide you with a high-level, flexible, and high-performance set of core manipulations
and algorithms to enable you to wrangle data into the right form.

### Combining and Merging Data Sets
Data contained in pandas objects can be combined together in a number of built-in
ways:

• pandas.merge connects rows in DataFrames based on one or more keys. This will
be familiar to users of SQL or other relational databases, as it implements database
join operations.

• pandas.join

• pandas.concat glues or stacks together objects along an axis.

In [2]:
import pandas as pd

In [3]:
df1 = pd.DataFrame({'ID':[1,2,3,5,9],
                    'Col_1':[1,2,3,4,5],
                    'Col_2':[6,7,8,9,10],
                    'Col_3':[11,12,13,14,15],
                    'Col_4':['apple','orange','banana','strawberry','raspberry']
                    })

df2 = pd.DataFrame({'ID':[1,1,3,5],
                    'Col_A':[8,9,10,11],
                    'Col_B':[12,13,15,17],
                    'Col_4':['apple','orange','banana','kiwi']
                   })

In [4]:
df1

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4
0,1,1,6,11,apple
1,2,2,7,12,orange
2,3,3,8,13,banana
3,5,4,9,14,strawberry
4,9,5,10,15,raspberry


In [5]:
df2

Unnamed: 0,ID,Col_A,Col_B,Col_4
0,1,8,12,apple
1,1,9,13,orange
2,3,10,15,banana
3,5,11,17,kiwi


# pd.merge() - Database-style DataFrame Merges
* For combining data on common columns
* Most flexible, but also complex of the methods we'll discuss
* many-to-one and many-to-many joins are possible
* Side-by-side merge

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

In [51]:
# we need to tell merge what column to merge on
# we are adding more columns to the df or placing the two columns side by side
# we only keep the rows that are common to both df's (Inner Join - this is default)
# in this example df1 is the left df and df2 is the right df (this can be changed by switching the order)
# if specify left_on and right_on with index you get key_0 this can be avoided with left_index=True, right_index=True
# one-to-many = multiple 1's in df2 merged with the 1 on df1 (df1 repeated it's 1 row to match df2's 1's)
inner = pd.merge(df1, df2) # left_on=df1.index, right_on=df2.index    # left_index=True, right_index=True
inner # default is inner every column with same name - if they are different dtypes you will get an error 

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,50
1,1002,Samwise,Gardening,39


In [7]:
pd.merge(df1, df2, on='ID')

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4_x,Col_A,Col_B,Col_4_y
0,1,1,6,11,apple,8,12,apple
1,1,1,6,11,apple,9,13,orange
2,3,3,8,13,banana,10,15,banana
3,5,4,9,14,strawberry,11,17,kiwi


In [8]:
# this is the same as the first merge because we are specifying both shared columns
pd.merge(df1, df2, on=['ID','Col_4'])

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4,Col_A,Col_B
0,1,1,6,11,apple,8,12
1,3,3,8,13,banana,10,15


In [9]:
# If the column names are different in each object, you can specify them separately:
pd.merge(df1, df2, left_on='Col_2', right_on='Col_A')

Unnamed: 0,ID_x,Col_1,Col_2,Col_3,Col_4_x,ID_y,Col_A,Col_B,Col_4_y
0,3,3,8,13,banana,1,8,12,apple
1,5,4,9,14,strawberry,1,9,13,orange
2,9,5,10,15,raspberry,3,10,15,banana


In [10]:
# Suffixes & different columns 
pd.merge(df1, df2, suffixes=['_l','_r'], left_on='Col_2', right_on='Col_A')

Unnamed: 0,ID_l,Col_1,Col_2,Col_3,Col_4_l,ID_r,Col_A,Col_B,Col_4_r
0,3,3,8,13,banana,1,8,12,apple
1,5,4,9,14,strawberry,1,9,13,orange
2,9,5,10,15,raspberry,3,10,15,banana


In [11]:
df1

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4
0,1,1,6,11,apple
1,2,2,7,12,orange
2,3,3,8,13,banana
3,5,4,9,14,strawberry
4,9,5,10,15,raspberry


In [12]:
df2

Unnamed: 0,ID,Col_A,Col_B,Col_4
0,1,8,12,apple
1,1,9,13,orange
2,3,10,15,banana
3,5,11,17,kiwi


In [13]:
# indexes
pd.merge(df1, df2, suffixes=['_l','_r'], left_index=True, right_index=True)

Unnamed: 0,ID_l,Col_1,Col_2,Col_3,Col_4_l,ID_r,Col_A,Col_B,Col_4_r
0,1,1,6,11,apple,1,8,12,apple
1,2,2,7,12,orange,1,9,13,orange
2,3,3,8,13,banana,3,10,15,banana
3,5,4,9,14,strawberry,5,11,17,kiwi


#### INNER
in an inner join, you will lose rows that don’t have a match in the other DataFrame’s key column. <br>
#### OUTER
in an outer join (also known as a full outer join), all rows from both DataFrames will be present in the new DataFrame.<br>
#### LEFT
Using a left outer join will leave your new merged DataFrame with all rows from the left DataFrame, while discarding rows from the right DataFrame that don’t have a match in the key column of the left DataFrame.
<br>
#### RIGHT
The right join (or right outer join) is the mirror-image version of the left join. With this join, all rows from the right DataFrame will be retained, while rows in the left DataFrame without a match in the key column of the right DataFrame will be discarded.


In [14]:
# Changing the type of merge
# you can also do outer, 
                # left, and 
                # right
# Outer will include all rows and fill missing with NaN
pd.merge(df1, df2, on='Col_4', how='outer', suffixes=['_l','_r'],)

Unnamed: 0,ID_l,Col_1,Col_2,Col_3,Col_4,ID_r,Col_A,Col_B
0,1.0,1.0,6.0,11.0,apple,1.0,8.0,12.0
1,2.0,2.0,7.0,12.0,orange,1.0,9.0,13.0
2,3.0,3.0,8.0,13.0,banana,3.0,10.0,15.0
3,5.0,4.0,9.0,14.0,strawberry,,,
4,9.0,5.0,10.0,15.0,raspberry,,,
5,,,,,kiwi,5.0,11.0,17.0


In [15]:
pd.merge(df1, df2, on='Col_4', how='left', suffixes=['_l','_r'],)

Unnamed: 0,ID_l,Col_1,Col_2,Col_3,Col_4,ID_r,Col_A,Col_B
0,1,1,6,11,apple,1.0,8.0,12.0
1,2,2,7,12,orange,1.0,9.0,13.0
2,3,3,8,13,banana,3.0,10.0,15.0
3,5,4,9,14,strawberry,,,
4,9,5,10,15,raspberry,,,


In [16]:
pd.merge(df1, df2, on='Col_4', how='right', suffixes=['_l','_r'],)

Unnamed: 0,ID_l,Col_1,Col_2,Col_3,Col_4,ID_r,Col_A,Col_B
0,1.0,1.0,6.0,11.0,apple,1,8,12
1,2.0,2.0,7.0,12.0,orange,1,9,13
2,3.0,3.0,8.0,13.0,banana,3,10,15
3,,,,,kiwi,5,11,17


In [17]:
# We get a key error becuase Col_1 is not in both df
pd.merge(df1, df2, on = 'Col_1' )    

KeyError: 'Col_1'

# df.join()
* Anything you can do with .join you can do with .merge

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [21]:
# default join is left
# remember we don't have to specifiy on, but we should always do it anyways
df1.join(df2, on='ID', lsuffix='_l',rsuffix='_r')

Unnamed: 0,ID_l,Col_1,Col_2,Col_3,Col_4_l,ID_r,Col_A,Col_B,Col_4_r
0,1,1,6,11,apple,1.0,9.0,13.0,orange
1,2,2,7,12,orange,3.0,10.0,15.0,banana
2,3,3,8,13,banana,5.0,11.0,17.0,kiwi
3,5,4,9,14,strawberry,,,,
4,9,5,10,15,raspberry,,,,


In [22]:
df1.join(df2, on='ID', how='inner', lsuffix='_l',rsuffix='_r')

Unnamed: 0,ID,ID_l,Col_1,Col_2,Col_3,Col_4_l,ID_r,Col_A,Col_B,Col_4_r
0,1,1,1,6,11,apple,1,9,13,orange
1,2,2,2,7,12,orange,3,10,15,banana
2,3,3,3,8,13,banana,5,11,17,kiwi


# pd.concat()
* good for side by side and stacking on top of each other


Concatenation is a bit different from the merging techniques you saw above. With merging, you can expect the resulting dataset to have rows from the parent datasets mixed in together, often based on some commonality. Depending on the type of merge, you might also lose rows that don’t have matches in the other dataset.

With concatenation, your datasets are just stitched together along an axis — either the row axis or column axis.
https://pandas.pydata.org/docs/reference/api/pandas.concat.html 

In [23]:
# default axis is 0 (stack)
pd.concat([df1, df2])

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4,Col_A,Col_B
0,1,1.0,6.0,11.0,apple,,
1,2,2.0,7.0,12.0,orange,,
2,3,3.0,8.0,13.0,banana,,
3,5,4.0,9.0,14.0,strawberry,,
4,9,5.0,10.0,15.0,raspberry,,
0,1,,,,apple,8.0,12.0
1,1,,,,orange,9.0,13.0
2,3,,,,banana,10.0,15.0
3,5,,,,kiwi,11.0,17.0


In [24]:
# reset the index
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4,Col_A,Col_B
0,1,1.0,6.0,11.0,apple,,
1,2,2.0,7.0,12.0,orange,,
2,3,3.0,8.0,13.0,banana,,
3,5,4.0,9.0,14.0,strawberry,,
4,9,5.0,10.0,15.0,raspberry,,
5,1,,,,apple,8.0,12.0
6,1,,,,orange,9.0,13.0
7,3,,,,banana,10.0,15.0
8,5,,,,kiwi,11.0,17.0


In [25]:
# can side-by-side by specifying axis=1 
pd.concat([df1, df2], axis=1)

# Note: This example assumes that your indices are the same between datasets. 
#     If they are different while concatenating along columns (axis 1), 
#     then by default the extra indices (rows) will also be added, and NaN 
#     values will be filled in as applicable.
# so no data is lost by default

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4,ID.1,Col_A,Col_B,Col_4.1
0,1,1,6,11,apple,1.0,8.0,12.0,apple
1,2,2,7,12,orange,1.0,9.0,13.0,orange
2,3,3,8,13,banana,3.0,10.0,15.0,banana
3,5,4,9,14,strawberry,5.0,11.0,17.0,kiwi
4,9,5,10,15,raspberry,,,,


In [135]:
# reset the index
pd.concat([df1, df2], axis=1, ignore_index=True)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,1,6,11,apple,1.0,8.0,12.0,1.0
1,2,2,7,12,orange,1.0,9.0,13.0,2.0
2,3,3,8,13,banana,3.0,10.0,15.0,3.0
3,5,4,9,14,strawberry,5.0,11.0,17.0,4.0
4,9,5,10,15,raspberry,,,,


In [134]:
# default join is outer, but you can specify inner where data will 
# be lost in the other dataset (right) if there is no match
pd.concat([df1, df2], axis=1, join = 'inner')

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4,ID.1,Col_A,Col_B,Col_4.1
0,1,1,6,11,apple,1,8,12,1
1,2,2,7,12,orange,1,9,13,2
2,3,3,8,13,banana,3,10,15,3
3,5,4,9,14,strawberry,5,11,17,4


In [201]:
pd.concat([df1, df2], axis=0, join = 'inner')

Unnamed: 0,ID,Col_4
0,1,apple
1,2,orange
2,3,banana
3,5,strawberry
4,9,raspberry
0,1,apple
1,1,orange
2,3,banana
3,5,kiwi


# df.append()

In [26]:
df1.append(df2)

Unnamed: 0,ID,Col_1,Col_2,Col_3,Col_4,Col_A,Col_B
0,1,1.0,6.0,11.0,apple,,
1,2,2.0,7.0,12.0,orange,,
2,3,3.0,8.0,13.0,banana,,
3,5,4.0,9.0,14.0,strawberry,,
4,9,5.0,10.0,15.0,raspberry,,
0,1,,,,apple,8.0,12.0
1,1,,,,orange,9.0,13.0
2,3,,,,banana,10.0,15.0
3,5,,,,kiwi,11.0,17.0


In [27]:
df1.append(df2, sort=True)

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_A,Col_B,ID
0,1.0,6.0,11.0,apple,,,1
1,2.0,7.0,12.0,orange,,,2
2,3.0,8.0,13.0,banana,,,3
3,4.0,9.0,14.0,strawberry,,,5
4,5.0,10.0,15.0,raspberry,,,9
0,,,,apple,8.0,12.0,1
1,,,,orange,9.0,13.0,1
2,,,,banana,10.0,15.0,3
3,,,,kiwi,11.0,17.0,5


## Class Exercise

In [65]:
import pandas as pd

In [66]:
df1 = pd.read_csv(r"list1.csv")
df1

Unnamed: 0,FellowshipID,FirstName,Skills
0,1001,Frodo,Hiking
1,1002,Samwise,Gardening
2,1003,Gandalf,Spells
3,1004,Pippin,Cycling


In [67]:
df2 = pd.read_csv(r"list2.csv")
df2

Unnamed: 0,FellowshipID,FirstName,Age
0,1001,Frodo,50
1,1002,Samwise,39
2,1006,Legolas,25
3,1007,Elrond,43
4,1008,Barromir,51


In [68]:
df1.merge(df2, how = 'inner', on = ['FellowshipID', 'FirstName'])

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,50
1,1002,Samwise,Gardening,39


In [69]:
df1.merge(df2, how = 'outer')

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,50.0
1,1002,Samwise,Gardening,39.0
2,1003,Gandalf,Spells,
3,1004,Pippin,Cycling,
4,1006,Legolas,,25.0
5,1007,Elrond,,43.0
6,1008,Barromir,,51.0


In [70]:
df1.merge(df2, how = 'left')

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,50.0
1,1002,Samwise,Gardening,39.0
2,1003,Gandalf,Spells,
3,1004,Pippin,Cycling,


In [71]:
df1.merge(df2, how = 'right')

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,50
1,1002,Samwise,Gardening,39
2,1006,Legolas,,25
3,1007,Elrond,,43
4,1008,Barromir,,51


In [72]:
df1.merge(df2, how = 'cross')

Unnamed: 0,FellowshipID_x,FirstName_x,Skills,FellowshipID_y,FirstName_y,Age
0,1001,Frodo,Hiking,1001,Frodo,50
1,1001,Frodo,Hiking,1002,Samwise,39
2,1001,Frodo,Hiking,1006,Legolas,25
3,1001,Frodo,Hiking,1007,Elrond,43
4,1001,Frodo,Hiking,1008,Barromir,51
5,1002,Samwise,Gardening,1001,Frodo,50
6,1002,Samwise,Gardening,1002,Samwise,39
7,1002,Samwise,Gardening,1006,Legolas,25
8,1002,Samwise,Gardening,1007,Elrond,43
9,1002,Samwise,Gardening,1008,Barromir,51


In [73]:
df1.join(df2, on = 'FellowshipID', how = 'outer', lsuffix = '_Left',rsuffix = '_Right')

Unnamed: 0,FellowshipID,FellowshipID_Left,FirstName_Left,Skills,FellowshipID_Right,FirstName_Right,Age
0.0,1001,1001.0,Frodo,Hiking,,,
1.0,1002,1002.0,Samwise,Gardening,,,
2.0,1003,1003.0,Gandalf,Spells,,,
3.0,1004,1004.0,Pippin,Cycling,,,
,0,,,,1001.0,Frodo,50.0
,1,,,,1002.0,Samwise,39.0
,2,,,,1006.0,Legolas,25.0
,3,,,,1007.0,Elrond,43.0
,4,,,,1008.0,Barromir,51.0


In [74]:
df4 = df1.set_index('FellowshipID').join(df2.set_index('FellowshipID'), lsuffix = '_Left',rsuffix = '_Right', how = 'outer')
df4

Unnamed: 0_level_0,FirstName_Left,Skills,FirstName_Right,Age
FellowshipID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,Frodo,Hiking,Frodo,50.0
1002,Samwise,Gardening,Samwise,39.0
1003,Gandalf,Spells,,
1004,Pippin,Cycling,,
1006,,,Legolas,25.0
1007,,,Elrond,43.0
1008,,,Barromir,51.0


In [75]:
pd.concat([df1,df2], join = 'outer', axis = 1)

Unnamed: 0,FellowshipID,FirstName,Skills,FellowshipID.1,FirstName.1,Age
0,1001.0,Frodo,Hiking,1001,Frodo,50
1,1002.0,Samwise,Gardening,1002,Samwise,39
2,1003.0,Gandalf,Spells,1006,Legolas,25
3,1004.0,Pippin,Cycling,1007,Elrond,43
4,,,,1008,Barromir,51


In [76]:
df1.append(df2)

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,
1,1002,Samwise,Gardening,
2,1003,Gandalf,Spells,
3,1004,Pippin,Cycling,
0,1001,Frodo,,50.0
1,1002,Samwise,,39.0
2,1006,Legolas,,25.0
3,1007,Elrond,,43.0
4,1008,Barromir,,51.0


## Rename Columns

In [101]:
df3 = df1.merge(df2, how='left')
df3

Unnamed: 0,FellowshipID,FirstName,Skills,Age
0,1001,Frodo,Hiking,50.0
1,1002,Samwise,Gardening,39.0
2,1003,Gandalf,Spells,
3,1004,Pippin,Cycling,


In [102]:
df3 = df3.rename({'FirstName': 'First_Name', 'Skills': 'Skill_type'}, axis=1)
# df3.set_index('FellowshipID', inplace=True)
df3

Unnamed: 0,FellowshipID,First_Name,Skill_type,Age
0,1001,Frodo,Hiking,50.0
1,1002,Samwise,Gardening,39.0
2,1003,Gandalf,Spells,
3,1004,Pippin,Cycling,


## Reverse column order

In [103]:
df3.loc[:, ::-1]


Unnamed: 0,Age,Skill_type,First_Name,FellowshipID
0,50.0,Hiking,Frodo,1001
1,39.0,Gardening,Samwise,1002
2,,Spells,Gandalf,1003
3,,Cycling,Pippin,1004


### Select columns by data type

In [104]:
df3.dtypes

FellowshipID      int64
First_Name       object
Skill_type       object
Age             float64
dtype: object

In [109]:
df3.select_dtypes(include='object')

Unnamed: 0,First_Name,Skill_type
0,Frodo,Hiking
1,Samwise,Gardening
2,Gandalf,Spells
3,Pippin,Cycling


In [110]:
df3.select_dtypes(exclude='object')

Unnamed: 0,FellowshipID,Age
0,1001,50.0
1,1002,39.0
2,1003,
3,1004,


In [111]:
df3.select_dtypes(include=['int64', 'float64'])

Unnamed: 0,FellowshipID,Age
0,1001,50.0
1,1002,39.0
2,1003,
3,1004,


## Turn numbers into categories 

In [115]:
df = pd.read_csv('Salaries.csv')
df.head()

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800


In [117]:
df['salary'].min()

57800

In [118]:
df['salary'].max()

186960

In [139]:
df['sal_categories'] = pd.cut(df['salary'], 
       bins=[int(df['salary'].min())-1, 100000, 150000, int(df['salary'].max())+1],
       labels=['low', 'medium', 'high'])

In [123]:
df.head(30)

Unnamed: 0,rank,discipline,phd,service,sex,salary,sal_categories
0,Prof,B,56,49,Male,186960,high
1,Prof,A,12,6,Male,93000,medium
2,Prof,A,23,20,Male,110515,high
3,Prof,A,40,31,Male,131205,high
4,Prof,B,20,18,Male,104800,high
5,Prof,A,20,20,Male,122400,high
6,AssocProf,A,20,17,Male,81285,medium
7,Prof,A,18,18,Male,126300,high
8,Prof,A,29,19,Male,94350,medium
9,Prof,A,51,51,Male,57800,medium


In [138]:
df['sal_categories'].value_counts(normalize=True)

medium    0.474359
low       0.410256
high      0.115385
Name: sal_categories, dtype: float64

In [141]:
import pandas_profiling

ModuleNotFoundError: No module named 'pandas_profiling'

In [None]:
pandas_profiling.ProfileReport(df)

## RegEx in Pandas
Regular expressions provide a flexible way to search or match string patterns in text. A
single expression, commonly called a regex, is a string formed according to the regular
expression language. Python’s built-in re module is responsible for applying regular
expressions to strings;

The re module functions fall into three categories: pattern matching, substitution, and
splitting. Naturally these are all related; a regex describes a pattern to locate in the text,
which can then be used for many purposes. Let’s look at a simple example: suppose I
wanted to split a string with a variable number of whitespace characters (tabs, spaces,
and newlines). The regex describing one or more whitespace characters is \s+:

match and search are closely related to findall. While findall returns all matches in a
string, search returns only the first match. More rigidly, match only matches at the
beginning of the string.

# <center>REGEX</center> # 
<img src="https://www.oreilly.com/content/wp-content/uploads/sites/2/2019/06/email-regex_crop-ae942dc427c8cebd3a83c52d17389123.jpg" alt="regex" style="width: 250px;"/>

### 0. What is regex?
- email
- phone numbers
- credit card info
- ID's
- dates
- Any other kind of standarized info

### 1. Regex BASICS 📚
1.1. What <br />
1.2. Where <br />
1.3. How many  <br />
1.4. Brackets and groups <br/>


### 2. Regex in PYTHON 🐍 ####

    2.1. FINDING the pattern
        2.1.1. Search
        2.1.2. Match
        2.1.3. Find all


    2.2 WORKING with the pattern
        2.2.1. sub
        2.2.2. split


### 3. Regex and pandas  ###
3.1. Lambda & function<br />
3.2. Apply<br />
3.3. Hands on<br />


# 1. Regex basics

## 1.1. WHAT ##

### characters

##### capturing vs. escaping #####
##### * (need to be escaped: "\d")  #####

- `.`  Any Character Except New Line
- `\d` Digit (0-9)
- `\D` Not a Digit (0-9)
- `\w` Word Character (a-z, A-Z, 0-9, _)
- `\W` Not a Word Character
- `\s` Whitespace (space, tab, newline)
- `\S` Not Whitespace (space, tab, newline)


## 1.2. WHERE ##
### anchors & boundaries ###

- `\b` Word Boundary
- `\B` Not a Word Boundary
- `^`  Beginning of a String
- `$` End of a String



## 1.3. HOW MANY ##
### Quantifiers

- `*`       - 0 or More
- `+`       - 1 or More
- `?`       - 0 or One
- `{3}`     - Exact Numbe{3}     - Exact Number
- `{3,4}`   - Range of Numbers (Minimum, Maximum{3,4}   - Range of Numbers (Minimum, Maximum`)

## 1.4. Brackets and groups ##


- `[]`      - Matches Characters in brackets
- `[^ ]`    - Matches Characters NOT in brackets
- `|`       - Either Or
- `( )`     - Group

Need to be escaped: . ^ $ * + ? { } \ | ( )

### Sample Regexs ###

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

### 📚 Other resources

1️⃣ [The documentation](https://docs.python.org/3/howto/regex.html)<br />
2️⃣ [Simple CHEATSHEET](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)  <br />
3️⃣ [Tutorial datacamp](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial)<br />
4️⃣ https://www.dataquest.io/blog/regular-expressions-data-scientists/ <br />
5️⃣ https://www.kaggle.com/rtatman/fraudulent-email-corpus

### Let's practice ❗️ ###

Remember to:
- Make sure you know what you need to match
- It does match not only the lines, but the whole content
- Make it usable for your goals: r'.*' will necessarily match everything, try to make it somewhat specific.

https://regexone.com/lesson/matching_characters?

In [167]:
import re
text = "foo bar\t baz \tqux"

In [168]:
pattern = re.compile('\s+')

re.split(pattern, text)
# re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [169]:
pattern.findall(text)

[' ', '\t ', ' \t']

In [171]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [172]:
import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''


sentence = 'Start a sentence and then bring it to an end'

pattern = re.compile(r'start', re.I)

matches = pattern.search(sentence)

print(matches)

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>
<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>
