# Data Cleaning
In the notebook, I will take a look at the structure of data and clean as necessary. I'll also run preprocessing for text data here.

In [46]:
import pandas as pd
import numpy as np

import nltk
import re

In [2]:
ls DATA

judge-1377884607_tweet_product_company.csv


In [7]:
df = pd.read_csv('DATA/judge-1377884607_tweet_product_company.csv', engine = 'python')
df.head(3)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion


### Missing value

In [8]:
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

We cannot infer a tweet so I'll remove the missing tweet_text.

In [12]:
df = df.dropna(subset = ['tweet_text'])

Whether the tweet is towards apple or google is post-tweet variable, so it isn't relevant for this problem. So I'll entirely removed this columns.

In [15]:
df = df.drop('emotion_in_tweet_is_directed_at', axis = 1)

### Rename Columns
I'll just make the columns a little bit easier to call out.

In [17]:
df.columns = ['tweet', 'sentiment']

### Abnormal sentiment
Check if there's any abnormal sentiment

In [20]:
df.sentiment.value_counts()

No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: sentiment, dtype: int64

I can't tell is not a valid classification. So I'll remove these.

In [None]:
df = df[df.sentiment != "I can't tell"]

### No original tweet
If the text only contains RT and not actual texts, we will be judging based on the previous tweet, so we should remove these.

In [142]:
# remove RT text
def remove_RT(str_):
    return re.sub('(RT.+)', '', str_)

In [147]:
df['tweet'] = df['tweet'].map(remove_RT)

In [155]:
df = df[df['tweet'] != '']

In [156]:
df.sentiment.value_counts()

No emotion toward brand or product    4133
Positive emotion                      2406
Negative emotion                       474
I can't tell                           134
Name: sentiment, dtype: int64

# Train/Test Split
I'll divide the train/test split here so we can avoid any data leakage during preprocessing.

In [157]:
X = df['tweet']
y = df['sentiment']

In [158]:
from sklearn.model_selection import train_test_split

In [159]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)

I'll also separate validation set out of train set.

In [160]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = .2)

In [161]:
# saving
X_train.to_csv('DATA/X_train.csv')
y_train.to_csv('DATA/y_train.csv')
X_test.to_csv('DATA/X_test.csv')
y_test.to_csv('DATA/y_test.csv')
X_val.to_csv('DATA/X_val.csv')
y_val.to_csv('DATA/y_val.csv')