 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# Expanding contractions

* there are several ways we can expand contractions
    <br>
    
    * using RegEx
    * using word vector similarity

* the second solution requires knowledge of Deep Learning
    <br>
    
    * I won't go over it now since it is immposible to explain without properly introducing libraries such as Gensim and topics such as dense representations and word vectors
    <br>
    
    * in a few words, we convert words into vectors and then take a look at a large premade dataset of vectors to see which vector from this dataset is most similar to our vector 
    <br>
    
    * since they are used in similar contexts, this typically allows us to replace contractions with their expanded version (because the coefficient of similarity we get between the contraction and the expanded version is usually big)
    <br>
    
    * more on this later

* for now, let's focus on demonstrating how to catch all contractions using RegEx

## _Using RegEx_

* the idea is to create a dictionary where:
    <br>
    
    * keys = contractions
    * values = expanded versions

* relatively simple approach since the English language has a limited number of contractions in it
    

* the process is simple:
    <br>
    
    * create the dictionary of contractions in lowercase
    * create a function that can find contractions in some text using RegEx
    * create a function that replaces contractions with their expanded versions based on the created dictionary
    * use the function on **lowercased** text data
    * use **Advanced Sentence Segmentation** from NLTK to separate the lowercased string with no contractions into sentences - this will automatically separate sentences from each other and capitalize the beginning of each sentence

### Code example:

In [1]:
# Import what we need

import re
from nltk import sent_tokenize

In [2]:
# Create a dictionary of the most common contractions
# that appear in the English language

contractions = {
    "ain't": 'are not',
    "'s": ' is',
    "aren't": 'are not',
    "can't": 'cannot',
    "can't've": 'cannot have',
    "'cause": 'because',
    "could've": 'could have',
    "couldn't": 'could not',
    "couldn't've": 'could not have',
    "didn't": 'did not',
    "doesn't": 'does not',
    "don't": 'do not',
    "hadn't": 'had not',
    "hadn't've": 'had not have',
    "hasn't": 'has not',
    "haven't": 'have not',
    "he'd": 'he would',
    "he'd've": 'he would have',
    "he'll": 'he will',
    "he'll've": 'he will have',
    "how'd": 'how did',
    "how'd'y": 'how do you',
    "how'll": 'how will',
    "i'd": 'i would',
    "i'd've": 'i would have',
    "i'll": 'i will',
    "i'll've": 'i will have',
    "i'm": 'i am',
    "i've": 'i have',
    "isn't": 'is not',
    "it'd": 'it would',
    "it'd've": 'it would have',
    "it'll": 'it will',
    "it'll've": 'it will have',
    "let's": 'let us',
    "ma'am": 'madam',
    "mayn't": 'may not',
    "might've": 'might have',
    "mightn't": 'might not',
    "mightn't've": 'might not have',
    "must've": 'must have',
    "mustn't": 'must not',
    "mustn't've": 'must not have',
    "needn't": 'need not',
    "needn't've": 'need not have',
    "o'clock": 'of the clock',
    "oughtn't": 'ought not',
    "oughtn't've": 'ought not have',
    "shan't": 'shall not',
    "sha'n't": 'shall not',
    "shan't've": 'shall not have',
    "she'd": 'she would',
    "she'd've": 'she would have',
    "she'll": 'she will',
    "she'll've": 'she will have',
    "should've": 'should have',
    "shouldn't": 'should not',
    "shouldn't've": 'should not have',
    "so've": 'so have',
    "that'd": 'that would',
    "that'd've": 'that would have',
    "there'd": 'there would',
    "there'd've": 'there would have',
    "they'd": 'they would',
    "they'd've": 'they would have',
    "they'll": 'they will',
    "they'll've": 'they will have',
    "they're": 'they are',
    "they've": 'they have',
    "to've": 'to have',
    "wasn't": 'was not',
    "we'd": 'we would',
    "we'd've": 'we would have',
    "we'll": 'we will',
    "we'll've": 'we will have',
    "we're": 'we are',
    "we've": 'we have',
    "weren't": 'were not',
    "what'll": 'what will',
    "what'll've": 'what will have',
    "what're": 'what are',
    "what've": 'what have',
    "when've": 'when have',
    "where'd": 'where did',
    "where've": 'where have',
    "who'll": 'who will',
    "who'll've": 'who will have',
    "who've": 'who have',
    "why've": 'why have',
    "will've": 'will have',
    "won't": 'will not',
    "won't've": 'will not have',
    "would've": 'would have',
    "wouldn't": 'would not',
    "wouldn't've": 'would not have',
    "y'all": 'you all',
    "y'all'd": 'you all would',
    "y'all'd've": 'you all would have',
    "y'all're": 'you all are',
    "y'all've": 'you all have',
    "you'd": 'you would',
    "you'd've": 'you would have',
    "you'll": 'you will',
    "you'll've": 'you will have',
    "you're": 'you are',
    "you've": 'you have'
}

In [3]:
# Create function for replacing contractions

def replace_contractions(text, contractions_map):
    contractions_re = re.compile("(%s)" % "|".join(contractions_map.keys()))

    def expand(matched):
        return contractions_map[matched.group(0)]
    
    return contractions_re.sub(expand, text)

In [4]:
# Example text data

text = "I'm only going to ask one more time, who'll look into it? Don't make me choose."

In [5]:
# Lowercase text data

text_lowercased = text.lower()
text_lowercased

"i'm only going to ask one more time, who'll look into it? don't make me choose."

In [6]:
# Example of replacing contractions

clean_text_lowercased = replace_contractions(
    text_lowercased, 
    contractions
)

print(clean_text_lowercased)

i am only going to ask one more time, who will look into it? do not make me choose.


In [7]:
# Advanced Sentence Segmentation

# Makes sure that we not only separate sentences
# but that we also try to uppercase where necessary


# Separate text into sentences and capitalize
sentences = [sent.capitalize() 
             for sent in sent_tokenize(clean_text_lowercased)]

sentences

['I am only going to ask one more time, who will look into it?',
 'Do not make me choose.']

In [8]:
# If you want to join them back into one string
# you can use the .join() method

sentences_as_string = " ".join(sentences)
print(sentences_as_string)

I am only going to ask one more time, who will look into it? Do not make me choose.


**NOTE:**
    
* the function `replace_contractions()` can be applied to a column of a Pandas dataframe very easily
    <br>
    <br>
    
    
```
df["my_column"] = df["my_column"].apply(replace_contractions)
```
* you can turn the code above into a module and reuse it across projects, so you don't have to replicate all this code

## Honorable mention: _smaller specialized libraries_

* smaller, more obscure libraries that enjoy less support than more popular libraries

* for example, the **contractions** library
    <br>
    
    * very big problem with it and similar libraries: **they typically don't support the newest version of Python**
    * e.g. the **contractions** package supports up to Python 3.6.

* if that is not a deal breaker for you, you can find more about it on the following link: https://snyk.io/advisor/python/contractions

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>