# NLP Basics: Reading in text data & why do we need to clean the text?

### Read in semi-structured text data

In [1]:
# Read in the raw text
rawData = open("SMSSpamCollection.tsv").read()

# Print the raw data, first 500 characters
rawData[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

On observing the above block of text, it has \t and \n as separators. The \t's are between the labels and the text message bodies, and the \n's are typically at the end of those lines. To tackle this, replace the \n's with \t's that will allow the text to be split on \n character and return a list. 

In [3]:
#Converting the block of text to list
parsedData = rawData.replace('\t', '\n').split('\n')
parsedData[0:10]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham',
 "Nah I don't think he goes to usf, he lives around here though",
 'ham',
 'Even my brother is not like to speak with me. They treat me like aids patent.',
 'ham',
 'I HAVE A DATE ON SUNDAY WITH WILL!!']

In [5]:
#separating the label/target and the associated text
labelList = parsedData[0::2]
textList = parsedData[1::2]

print(labelList[0:5])
print(textList[0:5])

['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


In [6]:
#importing Pandas package
import pandas as pd

#dataframe for the above data
fullCorpus = pd.DataFrame({
    'Label': labelList,
    'Text': textList
})

fullCorpus.head()

ValueError: arrays must all be same length

Storing the data in the dataframe throws an error, saying the arrays must have the same length.So, we're going to check the length of each of these lists to see where the issue lies.

In [7]:
#lengths of the lable and text lists
print(len(labelList))
print(len(textList))

5571
5570


We can see that labelList has one extra entry that textList does not have. labelList might have picked up on something at the very end that is creating the mismatched length. So, let's print out the last five items of labelList. 

In [8]:
#checking the last entries of label list
print(labelList[-5:])

['ham', 'ham', 'ham', 'ham', '']


From above, the very last entry of labelList is empty. So, it is picking up on one extra entry that we don't need it to. So, if we just drop that, then it'll have the same length as textList, and they'll match up. 

In [11]:
#creating dataframe by dropping the empty entry
fullCorpus = pd.DataFrame({ 
    'Text': textList,
    'Label': labelList[:-1]
})

fullCorpus.head(10)

Unnamed: 0,Text,Label
0,I've been searching for the right words to tha...,ham
1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
2,"Nah I don't think he goes to usf, he lives aro...",ham
3,Even my brother is not like to speak with me. ...,ham
4,I HAVE A DATE ON SUNDAY WITH WILL!!,ham
5,As per your request 'Melle Melle (Oru Minnamin...,ham
6,WINNER!! As a valued network customer you have...,spam
7,Had your mobile 11 months or more? U R entitle...,spam
8,I'm gonna be home soon and i don't want to tal...,ham
9,"SIX chances to win CASH! From 100 to 20,000 po...",spam


The unstructured data at the begining of this notebook is converted into a structured dataframe data.

The shortcut method of the above process is that the data is tab delimited and Pandas allows you to read in tab-separated files very easily as follows.  

In [12]:
dataset = pd.read_csv("SMSSpamCollection.tsv", sep="\t", header=None)
dataset.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
