## Les expresions régulières
### Introduction Dataset

In [1]:
import csv

f = open("askreddit-2015.csv", 'r', encoding='utf-8')
posts = list(csv.reader(f))
posts[:10]

[['Title', 'Score', 'Time', 'Gold', 'NumComs'],
 ['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?',
  '11510',
  '1433213314.0',
  '1',
  '26195'],
 ["What's your favorite video that is 10 seconds or less?",
  '8656',
  '1434205517.0',
  '4',
  '8479'],
 ['What are some interesting tests you can take to find out about yourself?',
  '8480',
  '1443409636.0',
  '1',
  '4055'],
 ["PhD's of Reddit. What is a dumbed down summary of your thesis?",
  '7927',
  '1440188623.0',
  '0',
  '13201'],
 ['What is cool to be good at, yet uncool to be REALLY good at?',
  '7711',
  '1440082910.0',
  '0',
  '20325'],
 ['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?',
  '7598',
  '1439993280.0',
  '2',
  '5389'],
 ["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?",
  '7553',
  '143916180

In [2]:
posts = posts[1:]
for post in posts[:10]:
    print(post)

['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195']
["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479']
['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055']
["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']
['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325']
['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389']
["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520']
['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '1438822288.0',

### Compter les correspondances avec le module re()

In [7]:
# re.search(regex, string) => renvoie un objet "match" ou None
# "of Reddit"
import re

of_reddit_count = 0
for post in posts:
    if re.search('of Reddit', post[0]) is not None:
        of_reddit_count += 1
        
print(f"il y a {of_reddit_count} occurences de la chaîne 'of Reddit'")

il y a 76 occurences de la chaîne 'of Reddit'


### Crochets pour matcher avec plusieurs lettres

In [8]:
import re

of_reddit_count = 0
for post in posts:
    if re.search('of [rR]eddit', post[0]) is not None:
        of_reddit_count += 1
        
print(f"il y a {of_reddit_count} occurences de la chaîne 'of Reddit'")

il y a 102 occurences de la chaîne 'of Reddit'


### Ignorer des caractères spéciaux

In [10]:
# [Serious]

import re

serious_count = 0
for post in posts:
    if re.search('\[Serious\]', post[0]) is not None:
        serious_count += 1
        
print(f"il y a {serious_count} occurences de la chaîne '[Serious]'")



il y a 69 occurences de la chaîne '[Serious]'


### Amélioration de notre Regex

In [11]:
# Comptons les tags (Serious), (serious), [Serious] et [serious]

import re

serious_count = 0
for post in posts:
    if re.search('[\(\[][Ss]erious[\]\)]', post[0]) is not None:
        serious_count += 1
        
print(f"il y a {serious_count} occurences ")

il y a 80 occurences 


### Combiner plusieurs regex

In [13]:
# Comptons les tags (Serious), (serious), [Serious] et [serious] au début du titre -> serious_start_count
# Comptons les tags (Serious), (serious), [Serious] et [serious] en fin de titre -> serious_end_count
# Comptons les tags (Serious), (serious), [Serious] et [serious] au débutou à la fin  du titre -> serious_count_final

import re

serious_start_count = 0
serious_end_count = 0
serious_count_final = 0
for post in posts:
    if re.search('^[\(\[][Ss]erious[\]\)]', post[0]) is not None:
        serious_start_count += 1
    if re.search('[\(\[][Ss]erious[\]\)]$', post[0]) is not None:
        serious_end_count += 1
    if re.search('^[\(\[][Ss]erious[\]\)]|[\(\[][Ss]erious[\]\)]$', post[0]) is not None:
        serious_count_final += 1 
        

print(f"il y a {serious_start_count} occurences de Serious avec/sans majuscules crochets/parenthèses au début")        
print(f"il y a {serious_end_count} occurences de Serious avec/sans majuscules crochets/parenthèses à la fin")
print(f"il y a {serious_count_final} occurences de Serious avec/sans majuscules crochets/parenthèses  au début ou à la fin")

il y a 69 occurences de Serious avec/sans majuscules crochets/parenthèses au début
il y a 11 occurences de Serious avec/sans majuscules crochets/parenthèses à la fin
il y a 80 occurences de Serious avec/sans majuscules crochets/parenthèses  au début ou à la fin


### Modifier des chaînes de caractères avec regex

In [25]:
# re.sub(regex, chaîne de remplacement, chaîne de travail)
# remplaçons [serious], (serious) et (Serious) par [Serious]


import re

posts_new = []
# Ici nous allons compiler la regex (même si ce n'est pas demandé) en vue de plus grande rapidité
regex = re.compile('[\(\[][Ss]erious[\]\)]')
for post in posts:
    post[0] = re.sub(regex, '[Serious]', post[0])
    posts_new.append(post)
    
#    else:
#        new_post = post
#    posts_new.append(new_post)
    
print(posts_new[:10])


[['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195'], ["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479'], ['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055'], ["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201'], ['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325'], ['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389'], ["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520'], ['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '143882

### Matcher les années avec une regex

### Extraire toutes les années

In [29]:
# re.findall(regex, chaîne de travail)
re.findall("[a-z]", 'abd123')

['a', 'b', 'd']

In [28]:
# Utilisons re.findall() pour générer une liste (years) de toutes les années entre 1000 et 2999 dans year_string

year_string = 'On est déjà en 2019, une année de plus que 2018 et une de moins que 2020'
years = re.findall("[12]\d{3}", year_string)
years
    

['2019', '2018', '2020']