# Reading in unstructured text data

## 1. READ IN TEXT DATA

### We use the Python open() function to read in the text file, so that the process is more flexible for even messier dataframes.

In [28]:
rawData = open("SMSSpamCollection.tsv").read()
rawData[0:300]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receiv"

### We can see that the values are seperated by a '\t' which means it is a tab-delimited text file. For tsv files, we can call the Pandas read_csv function and specify '\t' as the seperator.

In [29]:
parsedData = rawData.replace('\t', '\n').split('\n')
parsedData[0:4]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]

In [30]:
labels = parsedData[0::2]
texts = parsedData[1::2]

print(labels[0:4])
print(texts[0:4])

['ham', 'spam', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.']


In [4]:
import pandas as pd

In [5]:
dataFrame = pd.DataFrame({
    'label': labels[:-1],
    'body': texts
})

dataFrame.head()

Unnamed: 0,label,body
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### To replicate the same process as above, we use the Pandas' approach. Note that the above used procedure is more flexible for even complex text files.

In [13]:
df = pd.read_csv("SMSSpamCollection.tsv", sep = "\t", header = None)
df.columns = ["label", "body"]
df.head()

Unnamed: 0,label,body
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## 2. ANALYSING THE TEXT DATA

In [14]:
# Number of rows and columns in the dataset
df.shape

(5568, 2)

In [25]:
# Number of ham entries
print("Ham: {}".format(len(df[df['label'] == 'ham'])))

# Number of spam entries
print("Spam: {}".format(len(df[df['label'] == 'spam'])))

Ham: 4822
Spam: 746


In [33]:
# Checking for null values
df.isnull().sum()

label    0
body     0
dtype: int64