# File 04: Preprocessing and Splitting Original Dataset

This file is quiet essential. This makes sure we have the necessary data to train our model. Our source that set contains 1.6 million labelled tweets. Out of which we have removed 20,000 tweets in 'File 01'. So to avoid training the model on data we might predict on in the future steps, we divide the entire dataset into 'main-data' and 'train-data'.

We also preprocess all the tweets in this dataset.
The preprocessing involves:
- Removing @_usernames_
- Removing Hashtags
- Removing Hyperlinks
- Removing extra spaces
- Removing Any digits
- Removing Stopwords
- Removing Single Characters

### Input File:
- 1600k-noemoticons.csv

### Ouptut File:
- 04-main-data.csv -----> 20,000 Entries (10,000 Pos, 10,000 Neg)
- 04-train-data.csv ----> 1,580,000 Entries (740,000 Pos, 740,000 Neg)

### Steps:
1. load required libraries (standard and machine leanring)
2. load and format the dataset
3. create functions that will preprocess the dataset
4. apply preprocessing on all tweets
5. convert the results into a dataframe
6. split the dataframe into 'main' and 'train'
7. save 'main' and 'train' dataframes

In [1]:
# Loading all standard libraries
import re
import nltk
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [4]:
# loading '1600k-noemoticon.csv' dataset
df = pd.read_csv("../db/1600k-noemoticon.csv", header=None)
df.isnull().values.any()
df.rename(
    columns = {
    0: 'SENTIMENT',
    1: 'ID',
    2: 'DATE',
    3: 'QUERY',
    4: 'USERNAME',
    5: 'TWEET'
    }, inplace=True, errors='raise'
)
df = df[['USERNAME', 'SENTIMENT', 'TWEET']]
df['TEXT'] = ""
dataset = df.values.tolist()

In [39]:
df['TWEET'][0]

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [41]:
# create functions that will preprocess the dataset
porter = PorterStemmer()
sw = stopwords.words('english')
sw.remove('not')

def remove_tags(text):
    TAG_RE = re.compile(r'<[^>]+>')
    return TAG_RE.sub('', text)

def remove_single_chars(text) :
    array = text.split()
    return (" ".join([w for w in array if len(w) > 1]))

def remove_stopwords(text) :
    text = " ".join([word for word in text.split() if word not in sw])
    return text

def preprocess_text(sen) :
    sentence = remove_tags(sen)
    sentence = sentence.lower()
    sentence = re.sub('@[A-Za-z]+[A-Za-z0-9-_]+', '', sentence)
    sentence = re.sub(r"http\S+", "", sentence)
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    sentence = remove_stopwords(sentence)
    sentence = remove_single_chars(sentence)
    return sentence

# define functions preprocess the dataset...


In [43]:
# apply preprocessing on all tweets
for node in tqdm(dataset):
    if node[1] > 1 :
        node[1] = 1
    node[3] = preprocess_text(node[2])

100%|██████████████████████████████████████████████████████████████████| 1600000/1600000 [00:29<00:00, 53488.04it/s]


In [45]:
# create a dataframe
df = pd.DataFrame(dataset, columns=['USER', 'SENTIMENT', 'ORIGINAL', 'TEXT'])
# df = df[['USER', 'TEXT', 'SENTIMENT']]

In [47]:
# split the dataframe
df_list = df.values.tolist()
df_main = df_list[:10000] + df_list[800000:810000]
df_train = df_list[10000:800000] + df_list[810000:]

In [49]:
# save final dataframe
main = pd.DataFrame(df_main, columns=['USER', 'SENTIMENT', 'ORIGINAL', 'TEXT'])
train = pd.DataFrame(df_train, columns=['USER', 'SENTIMENT', 'ORIGINAL', 'TEXT'])

In [52]:
main.SENTIMENT.value_counts()

0    10000
1    10000
Name: SENTIMENT, dtype: int64

In [53]:
train.SENTIMENT.value_counts()

0    790000
1    790000
Name: SENTIMENT, dtype: int64

In [54]:
# saving final dataframes
main.to_csv('../db/04-main-data.csv', index=None)
train.to_csv('../db/04-train-data.csv', index=None)

In [55]:
len(train)

1580000