# Energy NW Text Cleaning

### The goal of this notebook is to clean the Energy NW text data within the CONCAT_TEXT_FOR_WKS column. The aim is to consistently clean the data. The cleaned up dataframe will be published on Github. 

#### Import necessary libraries/tools

In [2]:
#!pip install spacy
#!pip install gensim
#!pip install spacy && python3 -m spacy download en

In [12]:
import nltk
import spacy
import unicodedata
import re
import nltk
from nltk.corpus import wordnet
import numpy as np
import pandas as pd
import collections
import text_normalizer as tn
#from textblob import Word
from nltk.tokenize.toktok import ToktokTokenizer
from bs4 import BeautifulSoup
import en_core_web_sm
import time

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Jewel.Matsch-
[nltk_data]     Rowekamp@ibm.com/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Load the Dataset

In [5]:
raw_df = pd.read_csv('Full_Clean.csv')
raw_df.head(10)

Unnamed: 0,AR_NUMBER,AR_OWED_TO_GRP,AR_TYPE,ORIGINATION_DATE,AR_PRIORITY,AR_SEVERITY,AR_SUBJECT,WORK_REQ_QTY,TSSSC,NON_TSSSC,DNC,COMP_MEASURE,QUALITY,EQUIP_REL,DESCRIPTION_NOTES,AR_EPN,PRIORITY_SEVERITY,CONCAT_TEXT_FOR_WKS
0,383472,AR-OPSWM,CR,20180815,CAQ,C,RCIC and HPCS Low CST level Swap-over.,2.0,Y,Y,Y,,N,YES,<D> **** Detailed Description **** During fill...,"02-00-HPCS-LS-1A,02-00-RCIC-LS-15A",CAQ:C,RCIC and HPCS Low CST level Swap-over.: Duri...
1,383473,CR-TRNDREP,CR,20180815,CAQ,D,RCIC-P-3 will not start in RUN,1.0,Y,Y,Y,,N,YES,<C> Item is a CRBB for Maintenance to gather a...,02-00-RCIC-P-3,CAQ:D,RCIC-P-3 will not start in RUN: SOP-RCIC-FIL...
2,383474,CR-TRNDREP,CR,20180815,CAQ,D,H13-P601 A1 drop 5-6 HPCS-SUCTION SWITCHOVER a...,1.0,Y,Y,Y,,N,YES,<D> **** Detailed Description **** H13-P601 A1...,02-00-HPCS-LS-1A,CAQ:D,H13-P601 A1 drop 5-6 HPCS-SUCTION SWITCHOVER a...
3,383476,CR-TRNDREP,CR,20180815,NCAQ,4,BRE 36 & 37,1.0,N,N,,,N,YES,<D> **** Detailed Description **** Both Bre's ...,,NCAQ:4,BRE 36 & 37: Both Bre's need to have their w...
4,383477,CR-CRG,CR,20180815,CAQ,D,Received the ROD DRIVE CONTROL SYS INOP alarm,,Y,Y,Y,,N,YES,<D> **** Detailed Description **** Received th...,,CAQ:D,Received the ROD DRIVE CONTROL SYS INOP alarm:...
5,383478,CR-TRNDREP,CR,20180815,CAQ,D,HPCS SUCTION SWITCHOVER ALARM LOCKED IN,4.0,Y,N,Y,,N,YES,<D> **** Detailed Description **** During fill...,"02-00-HPCS-LS-1A,02-00-HPCS-LS-1B,02-00-HPCS-P...",CAQ:D,HPCS SUCTION SWITCHOVER ALARM LOCKED IN: Dur...
6,383479,AR-OPSCM,CR,20180815,NCAQ,3,"CW-CT-1A Approx 1"" from overflow",,N,N,N,,N,NO,<D> **** Detailed Description **** During oper...,,NCAQ:3,"CW-CT-1A Approx 1"" from overflow: During ope..."
7,383490,CR-TRNDREP,CR,20180815,CAQ,D,Security CCTV33 needs maintenance,1.0,N,N,,,N,YES,<D> **** Detailed Description **** Security CC...,02-00-SEC-CCTV-33,CAQ:D,Security CCTV33 needs maintenance: Security ...
8,383491,AR-OPSMGT,CR,20180815,CAQ,C,SW-V-16A and 22A changed to LO without 50.59 s...,,Y,N,Y,,N,YES,<C> Reviewed ANON CR for content per SWP-CAP-0...,"02-00-SW-V-16A,02-00-SW-V-22A",CAQ:C,SW-V-16A and 22A changed to LO without 50.59 s...
9,383497,ITMGMT,CR,20180815,NCAQ,3,EP Copiers unable to Scan to Email,,,,,,N,NO,<D> **** Detailed Description **** New Canon C...,,NCAQ:3,EP Copiers unable to Scan to Email: New Cano...


In [6]:
raw_df.columns

Index(['AR_NUMBER', 'AR_OWED_TO_GRP', 'AR_TYPE', 'ORIGINATION_DATE',
       'AR_PRIORITY', 'AR_SEVERITY', 'AR_SUBJECT', 'WORK_REQ_QTY', 'TSSSC',
       'NON_TSSSC', 'DNC', 'COMP_MEASURE', 'QUALITY', 'EQUIP_REL',
       'DESCRIPTION_NOTES', 'AR_EPN', 'PRIORITY_SEVERITY',
       'CONCAT_TEXT_FOR_WKS'],
      dtype='object')

In [7]:
raw_df.shape

(11337, 18)

#### Select AR_NUMBER and Text Column

In [22]:
data_selection = raw_df.iloc[:,[0, 4, 5, 16, 17]]

In [26]:
data_selection.tail(5)

Unnamed: 0,AR_NUMBER,AR_PRIORITY,AR_SEVERITY,PRIORITY_SEVERITY,CONCAT_TEXT_FOR_WKS
11332,402851,NCAQ,4,NCAQ:4,p door b door latch sticking perform fire door...
11333,402858,NCAQ,4,NCAQ:4,conference call drop weekend production call w...
11334,402859,NCAQ,4,NCAQ:4,pwc p slow leak pump suction fit bermed area p...
11335,402860,CAQ,D,CAQ:D,filter aa end roll alarm alarm p filter aa end...
11336,402861,NCAQ,3,NCAQ:3,drum rw cause alara safety concern currently l...


### Clean Text Column

In [24]:
#Tracking how long this step takes.
start = time.time()

#Using the stopwords list from the natural language toolkit in english, which was displayed earlier.
stopword_list = nltk.corpus.stopwords.words('english')

#Removing negation words from the stopwords list to maintain them in the corpus.
stopword_list.remove('no')
stopword_list.remove('not')

# Call the normalize corpus function on the Article column of the unprocessed DataFrame
norm_corpus = tn.normalize_corpus(corpus=data_selection['CONCAT_TEXT_FOR_WKS'], html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=True, 
                                  text_stemming=True, special_char_removal=True, remove_digits=True,
                                  stopword_removal=True, stopwords=stopword_list)

#Add the new "Clean Article" column to the clean_data_df DataFrame
data_selection['CONCAT_TEXT_FOR_WKS'] = norm_corpus

#Calculate how much time this function took to work on the unprocessed dataframe.
end = time.time()
minutes = (end - start)/60

print('It took', minutes, ' minutes to clean the text data.')

It took 2.7521117170651754  minutes to clean the text data.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### Display Results of Cleaned Text Column

In [27]:
data_selection.tail(5)

Unnamed: 0,AR_NUMBER,AR_PRIORITY,AR_SEVERITY,PRIORITY_SEVERITY,CONCAT_TEXT_FOR_WKS
11332,402851,NCAQ,4,NCAQ:4,p door b door latch sticking perform fire door...
11333,402858,NCAQ,4,NCAQ:4,conference call drop weekend production call w...
11334,402859,NCAQ,4,NCAQ:4,pwc p slow leak pump suction fit bermed area p...
11335,402860,CAQ,D,CAQ:D,filter aa end roll alarm alarm p filter aa end...
11336,402861,NCAQ,3,NCAQ:3,drum rw cause alara safety concern currently l...


In [28]:
data_selection.shape

(11337, 5)

In [34]:
#Renaming for more descriptive name
energy_nw_clean_dataframe = data_selection

### Save DataFrame

In [35]:
energy_nw_clean_dataframe

Unnamed: 0,AR_NUMBER,AR_PRIORITY,AR_SEVERITY,PRIORITY_SEVERITY,CONCAT_TEXT_FOR_WKS
0,383472,CAQ,C,CAQ:C,rcic hpcs low cst level swap fill vent suction...
1,383473,CAQ,D,CAQ:D,rcic p not start run sop rcic fill direct star...
2,383474,CAQ,D,CAQ:D,hp drop hpcs suction switchover alarm hp drop ...
3,383476,NCAQ,4,NCAQ:4,bre bre need window evaluate bre
4,383477,CAQ,D,CAQ:D,receive rod drive control sys inop alarm recei...
...,...,...,...,...,...
11332,402851,NCAQ,4,NCAQ:4,p door b door latch sticking perform fire door...
11333,402858,NCAQ,4,NCAQ:4,conference call drop weekend production call w...
11334,402859,NCAQ,4,NCAQ:4,pwc p slow leak pump suction fit bermed area p...
11335,402860,CAQ,D,CAQ:D,filter aa end roll alarm alarm p filter aa end...


In [36]:
energy_nw_clean_dataframe.to_csv('energy_nw_clean_dataframe.csv', index=False)