## Text preprocessing using Spacy

Aim: To preprocess the dataset using Spacy

Description: spaCy is a free, open-source Python library that provides advanced capabilities to conduct natural language processing (NLP) on large volumes of text at high speed. It helps you build models and production applications that can underpin document analysis, chatbot capabilities, and all other forms of text analysis.

In [None]:
! pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6


In [None]:
!pip install spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting download
  Downloading download-0.3.5-py3-none-any.whl (8.8 kB)
Installing collected packages: download
Successfully installed download-0.3.5


In [None]:
import pandas as pd
import spacy
import matplotlib.pyplot as plt
from collections import Counter
import re
from sklearn.model_selection import train_test_split
from spacy.language import Language
from imblearn.over_sampling import SMOTE

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/NLP/Womens Clothing E-Commerce Reviews.csv')

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [None]:
data

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [None]:
data.isna().any()

Unnamed: 0                 False
Clothing ID                False
Age                        False
Title                       True
Review Text                 True
Rating                     False
Recommended IND            False
Positive Feedback Count    False
Division Name               True
Department Name             True
Class Name                  True
dtype: bool

In [None]:
import re
import string
import time 
nlp = spacy.load("en_core_web_sm")

In [None]:
def spacy_preprocess(text):
  text=str(text)
  text = re.sub(r'http\S+', '', text)
  text = re.sub(r'@\w+', '', text)
  text = re.sub(r"\ [A-Za-z]*\.com", " ", text)
  text = re.sub(r"[^a-zA-Z0-9:$-,%.?!]+", ' ',text) 
  text = re.sub(r"[|]", ' ',text) 
  doc=nlp(text)
  tokens=[token.lemma_ for token in doc if token.is_punct == False and token.is_space == False and token.like_url == False and token.like_email == False and token.is_stop == False]
  text=" ".join(tokens)
  return text

In [None]:
data['Title'] = data['Title'].apply(spacy_preprocess)

In [None]:
data

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,absolutely wonderful silky sexy comfortable,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,love dress sooo pretty happen find store glad ...,5,1,4,General,Dresses,Dresses
2,2,1077,60,major design flaw,high hope dress want work initially order peti...,3,0,0,General,Dresses,Dresses
3,3,1049,50,favorite buy,love love love jumpsuit fun flirty fabulous ti...,5,1,0,General Petite,Bottoms,Pants
4,4,847,47,flattering shirt,shirt flattering adjustable tie perfect length...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,great dress occasion,happy snag dress great price easy slip flatter...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,wish cotton,remind maternity clothe soft stretchy shiny ma...,3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,cute,fit work glad able try store order online diff...,3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,cute dress perfect summer party,buy dress wedding summer cute unfortunately fi...,3,1,2,General,Dresses,Dresses
