### Text Preprocessing
Text preprocessing includes a series of steps used to clean and prepare raw text data for analysis or modeling. It typically includes tasks such as removing noise (like HTML tags, URLs, punctuation, emojis, and emoticons), converting text to lowercase, correcting spelling, replacing abbreviations with full forms, removing stopwords, and normalizing words through stemming or lemmatization. 


In [118]:
# ! python3 -m pip install ipykernel


In [2]:
import pandas as pd

data = {
        "Employee Name": [
        "Saisab", "Greta", "Stanley", "Chloé", "Martin", "Ava", "Wes", "Kathryn", "Jordan", "Sofia",
        "Spike", "Patty", "Christopher", "Barry", "Mira", "James", "Lana", "Taika", "Denis", "Zoe"
    ],
    "Review": [
        "Fantastic place to work if you're passionate about films! #FilmLife 🎬",
        "The <b>editing suite</b> in Studio B keeps crashing, please fix it ASAP! ;_;",
        "Post-production deadlines are brutal and exhausting. 😡",
        "Loved working on the new sci-fi project, super creative team! #OnSet 💡",
        "<div>The cast and crew coordination has been flawless.</div>",
        "HR is attentive and really understands creatives. 🤗",
        "Cafeteria food on long shoot days needs an upgrade. 🍕☕",
        "Finishing VFX on time is tough sometimes. https://www.vfxtracker.com",
        "The studio has a great culture for innovation and expression! 😄",
        "The script review and approval process is streamlined. 📝🎥",
        "The workspace is clean, with excellent lighting for editors. 👌",
        "We need more training sessions on new filming tech. :[",
        "Management is open to feedback from even junior staff.",
        "Great perks like free movie passes and wellness programs! 🎉🍿",
        "The director of our last film was really supportive.",
        "I love the diversity of projects we get to work on.",
        "The equipment and tech are industry-leading!",
        "Found a broken link in our internal script database: <a href='https://example.com'>fix it</a>",
        "I've seen improvements in our production planning process.",
        "Workload is balanced and scheduling is fair."
    ]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Employee Name,Review
0,Saisab,Fantastic place to work if you're passionate a...
1,Greta,The <b>editing suite</b> in Studio B keeps cra...
2,Stanley,Post-production deadlines are brutal and exhau...
3,Chloé,"Loved working on the new sci-fi project, super..."
4,Martin,<div>The cast and crew coordination has been f...
5,Ava,HR is attentive and really understands creativ...
6,Wes,Cafeteria food on long shoot days needs an upg...
7,Kathryn,Finishing VFX on time is tough sometimes. http...
8,Jordan,The studio has a great culture for innovation ...
9,Sofia,The script review and approval process is stre...


In [4]:
review_df = df[['Review']]
review_df

Unnamed: 0,Review
0,Fantastic place to work if you're passionate a...
1,The <b>editing suite</b> in Studio B keeps cra...
2,Post-production deadlines are brutal and exhau...
3,"Loved working on the new sci-fi project, super..."
4,<div>The cast and crew coordination has been f...
5,HR is attentive and really understands creativ...
6,Cafeteria food on long shoot days needs an upg...
7,Finishing VFX on time is tough sometimes. http...
8,The studio has a great culture for innovation ...
9,The script review and approval process is stre...


#### REPLACE SHORT FORM WORDS WITH FULL FORM

In [6]:
full_form_dict = {
    'HR': 'Human Resources',
    'VFX': 'Visual Effects',
    'EHR': 'Electronic Health Record',  # Can be removed if not relevant; originally from hospital context
    'ASAP': 'as soon as possible',
    'DoP': 'Director of Photography',
    'AD': 'Assistant Director',
    'CGI': 'Computer Generated Imagery',
    'DP': 'Director of Photography',  # Alternative abbreviation
    'SFX': 'Special Effects',
    'PA': 'Production Assistant',
    'VO': 'Voice Over'
}


def correct_short_forms(text):
 
    words = text.split()
    corrected_words = [full_form_dict.get(word, word) for word in words]
    corrected_text = ' '.join(corrected_words)
    
    return corrected_text


review_df['Review'] = review_df['Review'].apply(correct_short_forms)
review_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(correct_short_forms)


Unnamed: 0,Review
0,Fantastic place to work if you're passionate a...
1,The <b>editing suite</b> in Studio B keeps cra...
2,Post-production deadlines are brutal and exhau...
3,"Loved working on the new sci-fi project, super..."
4,<div>The cast and crew coordination has been f...
5,Human Resource is attentive and really underst...
6,Cafeteria food on long shoot days needs an upg...
7,Finishing Visual Effects on time is tough some...
8,The studio has a great culture for innovation ...
9,The script review and approval process is stre...


#### LOWERCASING

In [7]:
review_df['Review']=review_df['Review'].str.lower()
review_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review']=review_df['Review'].str.lower()


Unnamed: 0,Review
0,fantastic place to work if you're passionate a...
1,the <b>editing suite</b> in studio b keeps cra...
2,post-production deadlines are brutal and exhau...
3,"loved working on the new sci-fi project, super..."
4,<div>the cast and crew coordination has been f...


#### REMOVE HTML TAGS

In [8]:
import re

def remove_html_tags(text):
    pattern = re.compile(r'<.*?>') 
    return pattern.sub('', text)


In [9]:
review_df['Review'] = review_df['Review'].apply(lambda text: remove_html_tags(text))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(lambda text: remove_html_tags(text))


In [10]:
review_df['Review'][1]

'the editing suite in studio b keeps crashing, please fix it asap! ;_;'

#### REMOVE URL

In [11]:
def remove_url(text):
    pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return pattern.sub(r'',text)

review_df['Review'] = review_df['Review'].apply(remove_url)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_url)


In [12]:
review_df['Review'][7]

'finishing visual effects on time is tough sometimes. '

#### REMOVE PUNCTUATION


In [13]:
import string

def remove_punctuation(text):
    pattern = re.compile(f"[{re.escape(string.punctuation)}]")
    return pattern.sub(r'',text)

review_df['Review'] = review_df['Review'].apply(remove_punctuation)
review_df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_punctuation)


Unnamed: 0,Review
0,fantastic place to work if youre passionate ab...
1,the editing suite in studio b keeps crashing p...
2,postproduction deadlines are brutal and exhaus...
3,loved working on the new scifi project super c...
4,the cast and crew coordination has been flawless


#### SPELLING CORRECTION


In [14]:
from textblob import TextBlob

def correct_spelling(text):
    textBLB = TextBlob(text)
    return textBLB.correct().string

review_df['Review'] = review_df['Review'].apply(correct_spelling)


ModuleNotFoundError: No module named 'textblob'

In [15]:
review_df['Review'][14]

'the director of our last film was really supportive'

In [20]:
import nltk
from nltk.corpus import stopwords

nltk.download('punkt')

nltk.download('stopwords')

from nltk.tokenize import word_tokenize,sent_tokenize



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saisab31\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saisab31\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


#### REMOVING STOPWORDS

In [21]:
stop_words = set(stopwords.words('english'))

In [133]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/fm-pc-
[nltk_data]     lt-275/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [29]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saisab31\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saisab31\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [31]:


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def removing_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_sentences = ' '.join(filtered_words)
    
    return filtered_sentences


review_df['Review'] = review_df['Review'].apply(removing_stopwords)
review_df.head(3)


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\saisab31/nltk_data'
    - 'c:\\Users\\saisab31\\Desktop\\saisab\\ir\\venv\\nltk_data'
    - 'c:\\Users\\saisab31\\Desktop\\saisab\\ir\\venv\\share\\nltk_data'
    - 'c:\\Users\\saisab31\\Desktop\\saisab\\ir\\venv\\lib\\nltk_data'
    - 'C:\\Users\\saisab31\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


#### REMOVE EMOJI

In [None]:


def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # Emojis
                               u"\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                               u"\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                               u"\U0001F700-\U0001F77F"  # Alchemical Symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U0001FB00-\U0001FBFF"  # Symbols for Legacy Computing
                               u"\U0001F004-\U0001F0CF"  # Miscellaneous Symbols and Arrows
                               u"\U0001F10D-\U0001F10F"  # Enclosed Alphanumeric Supplement
                               u"\U0001F200-\U0001F251"  # Enclosed Ideographic Supplement
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


review_df['Review'] = review_df['Review'].apply(remove_emojis)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_emojis)


In [136]:
review_df['Review'][6]

'cafeteria food needs improvement '

#### REMOVE EMOTICONS

In [137]:
EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

In [138]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

review_df['Review'] = review_df['Review'].apply(remove_emoticons)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_emoticons)


In [139]:
review_df['Review'][6]

'cafeteria food needs improvement '

#### TOKENIZATION - There are many ways to implement tokenization.

In [140]:
normal_text = "AI is getting smarter and can now make new things like art and stories."
normal_para = "AI is getting smarter and can now make new things like art and stories. This changes how people create and design."

##### Using the split function 

In [141]:
# work tokenization
tokenize1 = normal_text.split()
tokenize1

['AI',
 'is',
 'getting',
 'smarter',
 'and',
 'can',
 'now',
 'make',
 'new',
 'things',
 'like',
 'art',
 'and',
 'stories.']

In [142]:
# sentence tokenization
tokenize2 = normal_para.split(".")
tokenize2

['AI is getting smarter and can now make new things like art and stories',
 ' This changes how people create and design',
 '']

##### Using regular expression

In [143]:
import re
tokenize3 = re.findall("[\w']+",normal_text)
tokenize3

['AI',
 'is',
 'getting',
 'smarter',
 'and',
 'can',
 'now',
 'make',
 'new',
 'things',
 'like',
 'art',
 'and',
 'stories']

### Using NLTK

In [144]:
from nltk.tokenize import word_tokenize,sent_tokenize


In [145]:
word_tokenize(normal_text)

['AI',
 'is',
 'getting',
 'smarter',
 'and',
 'can',
 'now',
 'make',
 'new',
 'things',
 'like',
 'art',
 'and',
 'stories',
 '.']

In [146]:
sent_tokenize(normal_para)

['AI is getting smarter and can now make new things like art and stories.',
 'This changes how people create and design.']

### USING SPACY

In [165]:
# !python -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
tokenize4 = nlp(normal_text)
tokenize4

In [None]:
for token in tokenize4:
    print(token)

In [None]:
def spacy_tokenize(text):
    nlp = spacy.load('en_core_web_sm')
    tokenize_value = nlp(text)
    return tokenize_value

review_df['Review'] = review_df['Review'].apply(spacy_tokenize)


In [161]:
review_df.head(3)

Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible


### STEMMING

Stemming is the process of reducint inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.

Here we again pass the tokenize value over stemming and then with the list comprehension  created the list of word and join then to show again in the dataframe.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
nlp = spacy.load('en_core_web_sm')

def apply_stemming(text):
    
    tokenize_value = nlp(text)
    
    stemmed_words =  [stemmer.stem(token.text) for token in tokenize_value]
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

review_df['Review'] = review_df['Review'].apply(apply_stemming)


In [163]:
review_df.head(3)

Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible


### Lemmatization

Lemmatization, unlike Stemming , Reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word in scalled Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citatio nform of s set of words.

So the final outcome after all these text-pre-processing is:

In [164]:
pd.set_option('display.max_colwidth', None)
review_df

Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible
3,love new project amazing innovation
4,team suppurative
5,human resource department responsive helpful
6,cafeteria food needs improvement
7,meetings deadline challenging
8,company culture fantastic
9,company software dvelpmnt process quite efficient productive ️
