# Text mining

- Simple String methods
- Regular expresions

In [159]:
import pandas as pd
import numpy as np
import re

In [160]:
fut = pd.read_csv('../data/premier_league.csv')

In [161]:
fut.head()

Unnamed: 0,Club,Name,Age,Nationality,Position,Pos,MarketValue,ClubInvolved,CountryInvolved,Fee,Movement,Season,Window,League,Profile
0,Chelsea FC,Álvaro Morata,24,Spain,Centre-Forward,CF,£36.00m,Real Madrid,Spain,£59.40m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/alvaro-morata/...
1,Chelsea FC,Tiemoué Bakayoko,22,France,Defensive Midfield,DM,£14.40m,AS Monaco,France,£36.00m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/tiemoue-bakayo...
2,Chelsea FC,Danny Drinkwater,27,England,Central Midfield,CM,£8.10m,Leicester City,England,£34.11m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/danny-drinkwat...
3,Chelsea FC,Antonio Rüdiger,24,Germany,Centre-Back,CB,£22.50m,AS Roma,Italy,£31.50m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/antonio-rudige...
4,Chelsea FC,Davide Zappacosta,25,Italy,Right-Back,RB,£7.65m,Torino FC,Italy,£22.50m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/davide-zappaco...


`lower()`, `upper()`, `title()`

In [162]:
fut['Name'].str.lower().head()

0        álvaro morata
1     tiemoué bakayoko
2     danny drinkwater
3      antonio rüdiger
4    davide zappacosta
Name: Name, dtype: object

`strip()`

In [163]:
s = ' espacio              '
s.strip()

'espacio'

`split()`

In [164]:
# Tokenización
s = 'ciencia de redes'
s.split()

['ciencia', 'de', 'redes']

`center()`, `rjust()`, `zfill()`

In [165]:
print(s.center(30, "*"))
print(s.zfill(30))
print(s.rjust(30))

*******ciencia de redes*******
00000000000000ciencia de redes
              ciencia de redes


`swapcase()`, `join()`

In [166]:
s = 'Cienia De Redes'
s.swapcase()

'cIENIA dE rEDES'

In [167]:
s = ['ciencia', 'de', 'redes']
' '.join(s)

'ciencia de redes'

# Regular expressions

- 1951: Stephen Cole Kleene invented the concept of Regular Language and Regular Expressions.
- mid-1960: Ken Thompson implemented pattern matching using Kleen's notation.
- Since then: Regexes (Regular expressions) are ubiquitious in programming languages, text editors, etc.

## Some Regexes

- `.` | Matches any character except line terminators like `\n`
- `A|B` | Matches expression A or B. If A is matched first, B is left untried.
- `\w` | Matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _.
- `\d` | Matches digits, which means 0-9.

### The `re` module

```python
re.search(<regex>, <string>)
```

`re.search(<regex>, <string>)` scans `<string>` looking for the first location where the pattern `<regex>` matches. If a match is found, then `re.search()` returns a match object. Otherwise, it returns `None`.

In [168]:
# example, match a digit
s = '$ 340USD'
re.search('\d', s)

<re.Match object; span=(2, 3), match='3'>

- `\d`: Digit
- `+`: Greedily matches the expression to its left 1 or more times.

In [169]:
# Match all digits
s = '$ 128USD'
re.search('\$', s)

<re.Match object; span=(0, 1), match='$'>

In [170]:
# Match from the digits until the end
s = '$ 340USD'
r = re.search('(\d+)(\w+)', s)
print(r.groups())

('340', 'USD')


Groups

- `( )` | Matches the expression inside the parentheses and groups it.
- `.` Any character
- `*` Greedily matches the expression to its left 0 or more times.

In [171]:
# groups
print(r.groups())
# access group by group
print(r.group(2, 1))

('340', 'USD')
('USD', '340')


In [172]:
# Group dollar sign, digits and currency
s = '$ 340USD'
r = re.search('(^.) (\d+)(\w+)', s)
print(r.groups())

('$', '340', 'USD')


- `^`: Matches the expression to its right at the start of a string. It matches every such instance before each \n in the string.
- `\s`: Matches whitespace characters, which include the \t, \n, \r, and space characters.
- `\S` | Matches non-whitespace characters.

In [173]:
# Instead of `.` use `\S` to match `$`
s = '$ 340USD'
r = re.search('^(\S)\s+(\d+)(.*)', s)
print(r.groups())

('$', '340', 'USD')


### LookAhead and LookBehind

- `A(?=B)` | Lookahead assertion. This matches the expression A only if it is followed by B.
- `(?<=B)A` | Positive lookbehind assertion. This matches the expression A only if B is immediately to its left. This can only matched fixed length expressions.

In [174]:
# look for mxn pesos
s1 = '529MXN'
s2 = '10USD'
print(re.search('\d+(?=MXN)', s1))
print(re.search('\d+(?=MXN)', s2))

<re.Match object; span=(0, 3), match='529'>
None


In [175]:
# look for cents
s1 = '$529'
s2 = '¢10'
print(re.search('(?<=¢)\d+', s1))
print(re.search('(?<=¢)\d+', s2))

None
<re.Match object; span=(1, 3), match='10'>


Negative lookAhead and lookBehind


- `A(?!B)` | Negative lookahead assertion. This matches the expression A only if it is not followed by B.
- `(?<!B)A` | Negative lookbehind assertion. This matches the expression A only if B is not immediately to its left. This can only matched fixed length expressions.

In [176]:
# look for non-word pre-matches
s1 = 'p529'
s2 = '¢10'
print(re.search('(?<!\w)\d+', s1))
print(re.search('(?<!\w)\d+', s2))

None
<re.Match object; span=(1, 3), match='10'>


### Apply in columns

- Get the market value of each player as a float.
- Get the currency of the value.

In [177]:
fut.columns = fut.columns.str.lower()
fut.head()

Unnamed: 0,club,name,age,nationality,position,pos,marketvalue,clubinvolved,countryinvolved,fee,movement,season,window,league,profile
0,Chelsea FC,Álvaro Morata,24,Spain,Centre-Forward,CF,£36.00m,Real Madrid,Spain,£59.40m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/alvaro-morata/...
1,Chelsea FC,Tiemoué Bakayoko,22,France,Defensive Midfield,DM,£14.40m,AS Monaco,France,£36.00m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/tiemoue-bakayo...
2,Chelsea FC,Danny Drinkwater,27,England,Central Midfield,CM,£8.10m,Leicester City,England,£34.11m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/danny-drinkwat...
3,Chelsea FC,Antonio Rüdiger,24,Germany,Centre-Back,CB,£22.50m,AS Roma,Italy,£31.50m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/antonio-rudige...
4,Chelsea FC,Davide Zappacosta,25,Italy,Right-Back,RB,£7.65m,Torino FC,Italy,£22.50m,In,2017,s,premier-league,https://www.transfermarkt.co.uk/davide-zappaco...


In [178]:
# Eliminate white space
fut['marketvalue'] = fut['marketvalue'].str.replace('\s', '', regex=True)
fut['marketvalue'].tail()

791    £450Th.
792     £90Th.
793    £450Th.
794    £315Th.
795     £1.35m
Name: marketvalue, dtype: object

In [179]:
# Group currency, amount and unit
s1 = '£450Th.'
s2 = '$1.35m'
print(re.search('(\W)(\d+)(\w+)', s1).groups())
print(re.search('(\W)(\d+)(\w+)', s2).groups())

('£', '450', 'Th')
('.', '35', 'm')


In [180]:
# What about the decimal point?
s2 = '$1.35m'
s1 = '£450Th.'
print(re.search('(^\W)(\d+\.\d+)(\w+)', s2).groups())
print(re.search('(^\W)(\d+\.\d+)(\w+)', s1).groups())

('$', '1.35', 'm')


AttributeError: 'NoneType' object has no attribute 'groups'

In [181]:
s1

'£450Th.'

In [182]:
# Does it match non-decimal point amounts?
print(re.search('(\W)(\d+\.\d+)(\w+)', s1).groups())

AttributeError: 'NoneType' object has no attribute 'groups'

`?` | Greedily matches the expression to its left 0 or 1 times. But if ? is added to qualifiers (+, *, and ? itself) it will perform matches in a non-greedy manner.

In [183]:
# Use regular expression `?`
s1 = '£450Th.'
print(re.search('(\W)(\d+\.?\d+)(\w+)', s1).groups())

('£', '450', 'Th')


### Check for real cases

In [None]:
fut['marketvalue'].unique()

In [184]:
fut['marketvalue'].apply(get_market)

0      £
1      £
2      £
3      £
4      £
      ..
791    £
792    £
793    £
794    £
795    £
Name: marketvalue, Length: 796, dtype: object

In [185]:
def get_market(x, what='currency'):
    r = re.search('(\W)(\d+\.?\d+)(\w+)', x)
    if not r:
        return np.nan
    if what == 'currency':
        return r.group(1)
    elif what == 'amount':
        return float(r.group(2))
    elif what == 'unit':
        return r.group(3)
    

In [186]:
# Get currency, amount and unit
fut['currency'] = fut['marketvalue'].apply(get_market, what='currency')
fut['amount'] = fut['marketvalue'].apply(get_market, what='amount')
fut['unit'] = fut['marketvalue'].apply(get_market, what='unit')

### Multiply amount by unit

In [187]:
# What kind of units we have
fut['unit'].unique()

array(['m', nan, 'Th'], dtype=object)

```
Signature:
pd.Series.mask(
    self,
    cond,
    other=nan,
    inplace=False,
    axis=None,
    level=None,
    errors='raise',
    try_cast=False,
)
Docstring:
Replace values where the condition is True.
```

In [188]:
# Multiply using `mask`
fut['amount'] = fut['amount'].mask(fut['unit'] == 'm', fut['amount'] * 1000000)

fut['amount'] = fut['amount'].mask(fut['unit'] == 'Th', fut['amount'] * 1000)

### Change amount to millions

- Change all units to `'m'`

In [189]:
fut['amount'] = fut['amount'] / 1e6

In [190]:
fut.loc[fut['unit'].notna(), 'unit'] = 'm'

## Fine grain information about players' position

The position can be either:

 - Forward
 - Midfield
 - Back
 - Winger
 - Goalkeeper
 - Stricker

The direction can be:

- Centre
- Left
- Right

In [None]:
# Check for unique values
fut['position'].unique()

In [None]:
fut['position'] = fut['position'].str.lower()

In [None]:
fut['position'] = fut['position'].str.replace('\s', '-', regex=True)

In [None]:
fut['position'].unique()

In [None]:
def get_position(x, what='position'):
    r = re.search('(\w+)-(\w+)', x)
    if not r:
        return x
    if what == 'position':
        return r.group(2)
    else:
        return r.group(1)

In [None]:
fut['hor'] = fut['position'].apply(get_position)
fut['ver'] = fut['position'].apply(get_position, what='vertical')

In [None]:
fut['hor'].value_counts()

### Get the first and lastname of each player

- Get number of spaces in name.
- Add hyphen (`'-'`) in names with more than 2 spaces
- Get first and last name

In [None]:
fut['len'] = fut['name'].str.split().str.len()

In [None]:
fut[fut['len'] > 2].head()

`\b` | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W

In [None]:
fut['name'].str.replace(r'(\bvan\b)\s(\w+)', r'\1-\2', regex=True).loc[618]

In [None]:
fut['name'].str.replace(r'(\bde\b)\s(\w+)', r'\1-\2', regex=True).loc[560]

Remove `'Jr.'`

In [None]:
fut['name'].str.replace(r'\sJr.?', r'', regex=True).loc[9]

In [None]:
fut['name'] = fut['name'].str.replace(r'(\bvan\b)\s(\w+)', r'\1-\2', regex=True)
fut['name'] = fut['name'].str.replace(r'(\bde\b)\s(\w+)', r'\1-\2', regex=True)
fut['name'] = fut['name'].str.replace(r'\sJr.?', r'', regex=True)

Use `rsplit()` to get lastname

In [None]:
fut['name'].str.rsplit(n=1).head()

In [None]:
fut['last'] = fut['name'].str.rsplit(n=1).str[1]
fut['first'] = fut['name'].str.rsplit(n=1).str[0]

In [None]:
fut['first'].value_counts().head()

In [None]:
fut['last'].value_counts().head()

# References

- Regular Expressions: https://realpython.com/regex-python
- Cheat Sheet: https://www.dataquest.io/blog/regex-cheatsheet/
- Sudoku Regex: https://regexcrossword.com/