<h2 align=center> Preprocessing </h2>

---

- [Convert Datetime](#convert-datetime)
- [Document Classification](#document-classification)
- [Text Preprocessing] (#text-preprocessing)
    - [Tokenization] (#tokenization)
    - [Remove special characters] (#remove-special-characters)
    - [Stemming & Lemmatization] (#stemming-lemming)
    - [Removing Stopwords] (#remove-stop-words)
- [Text Preprocessing: A Function](#text-preprocessing-function)
- [Target Variable Numeric Encoding](#target-variable-numeric-encoding)



In [1]:
import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from num2words import num2words


In [2]:
df = pd.read_csv('/Users/rashidbaset/Code/cap_project/_data/raw-data/Tweets.csv')

### Converting Datetime

In [3]:
df['tweet_created'] = pd.to_datetime(df['tweet_created'], format = "%Y-%m-%d %H :%M:%S", errors='ignore')

### Initial Document Classification

Looking to first classify tweet as either neutral or non-neutral sentiment, then classify sentiment in tweets that are predicted to have polarity to simplify analysis to consider only positive and negative tweets.

Keeping tweets that were classified with full confidence by labelers.

In [4]:
df = df[df['airline_sentiment']!='neutral']
df = df[df['airline_sentiment_confidence']==1.0]

This narrowed our dataset to 8897 observations. 

In [5]:
len(df)

8897

### Text Pre-Processing 

For text analysis we're interested in processing text data to convert them into something coherent for analysis. 

We followed 4 steps:

1. Tokenization 
2. Remove special characters 
3. Stemming & Lemmatization
4. Removing Stopwords

#### Converting texts into tokens.

To gain a better understanding of what's happening under the hood when tokenizing, we pick some sentences that we're interested in comparing. The list will be used to compare the performance between different tokenizers.

In [17]:
df[df['airline_sentiment'] == 'negative'].tail(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
14631,569588464896876545,negative,1.0,Bad Flight,1.0,American,,MDDavis7,,0,@AmericanAir thx for nothing on getting us out...,,2015-02-22 12:04:07 -0800,US,Eastern Time (US & Canada)
14633,569587705937600512,negative,1.0,Cancelled Flight,1.0,American,,RussellsWriting,,0,@AmericanAir my flight was Cancelled Flightled...,,2015-02-22 12:01:06 -0800,Los Angeles,Arizona
14634,569587691626622976,negative,0.6684,Late Flight,0.6684,American,,GolfWithWoody,,0,@AmericanAir right on cue with the delays👌,,2015-02-22 12:01:02 -0800,,Quito
14636,569587371693355008,negative,1.0,Customer Service Issue,1.0,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14638,569587188687634433,negative,1.0,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


In [6]:
compare_list = ['@united stuck here in IAH waiting on flight 253 to Honolulu for 7 hours due to maintenance issues. Could we have gotten a new plane!?!? Fail',
               '@JetBlue had a great flight to Orlando from Hartford a few weeks ago! Was great to get out on time and arrive early!',
               '@AmericanAir my flight was Cancelled Flightled, leaving tomorrow morning. Auto rebooked for a Tuesday night flight but need to arrive Monday.']

In [7]:
from nltk import tokenize
from nltk.tokenize import TweetTokenizer

tweet=df.loc[8644, 'text'] 
Tokenizer = TweetTokenizer()
tokenized = Tokenizer.tokenize(tweet)

print('Original:')
print(tweet)
print('\nTokenized:')
print(tokenized)

Original:
@JetBlue had a great flight to Orlando from Hartford a few weeks ago! Was great to get out on time and arrive early!

Tokenized:
['@JetBlue', 'had', 'a', 'great', 'flight', 'to', 'Orlando', 'from', 'Hartford', 'a', 'few', 'weeks', 'ago', '!', 'Was', 'great', 'to', 'get', 'out', 'on', 'time', 'and', 'arrive', 'early', '!']


With TweetTokenizer, we're using a tokenizer built to tokenize tweets. 

#### Punctuation

Removing punctuation and converting characters to lowercase. The eclamation mark may be informative about the sentiment, so keep this as a token.

In [8]:
import string
punctuation = list(string.punctuation)
punctuation.remove('!')
tokenized_no_punctuation=[word.lower() for word in tokenized if word not in punctuation]
print(tokenized_no_punctuation)

['@jetblue', 'had', 'a', 'great', 'flight', 'to', 'orlando', 'from', 'hartford', 'a', 'few', 'weeks', 'ago', '!', 'was', 'great', 'to', 'get', 'out', 'on', 'time', 'and', 'arrive', 'early', '!']


#### Removing stopwords.

In [10]:
from nltk.corpus import stopwords
tokenized_no_stopwords=[word for word in tokenized_no_punctuation if word not in stopwords.words('english')]
print(tokenized_no_stopwords)

['@jetblue', 'great', 'flight', 'orlando', 'hartford', 'weeks', 'ago', '!', 'great', 'get', 'time', 'arrive', 'early', '!']


#### We choose the PorterStemmer library for stemming and lemmatization from the NLTK package.

In [11]:
from nltk.stem.porter import PorterStemmer
tokens = [PorterStemmer().stem(word) for word in tokenized_no_stopwords]
print(tokens)

['@jetblu', 'great', 'flight', 'orlando', 'hartford', 'week', 'ago', '!', 'great', 'get', 'time', 'arriv', 'earli', '!']


### Bringing it all together: A function to apply tweets to create data column containing tokens

In [12]:
from num2words import num2words

def tweet_preprocessor(text):
    tokenized = Tokenizer.tokenize(text)
    punctuation = list(string.punctuation)
    punctuation.remove('!')
    tokenized_no_punctuation=[word.lower() for word in tokenized if word not in punctuation]
    tokenized_no_stopwords=[word for word in tokenized_no_punctuation if word not in stopwords.words('english')]
    tokens = [PorterStemmer().stem(word) for word in tokenized_no_stopwords if word != '️']
    for i in range(len(tokens)):
        try:
            tokens[i]=num2words(tokens[i])
        except:
            pass
    return tokens

# Applies the tweet_preprocessor function separately to each element of the column 'message' 
df['tokens']=df['text'].apply(tweet_preprocessor)  

#### Taking a look at the results

In [13]:
df[['text','tokens']].head(10)

Unnamed: 0,text,tokens
3,@VirginAmerica it's really aggressive to blast...,"[@virginamerica, realli, aggress, blast, obnox..."
4,@VirginAmerica and it's a really big bad thing...,"[@virginamerica, realli, big, bad, thing]"
5,@VirginAmerica seriously would pay $30 a fligh...,"[@virginamerica, serious, would, pay, thirty, ..."
9,"@VirginAmerica it was amazing, and arrived an ...","[@virginamerica, amaz, arriv, hour, earli, good]"
11,@VirginAmerica I &lt;3 pretty graphics. so muc...,"[@virginamerica, <3, pretti, graphic, much, be..."
12,@VirginAmerica This is such a great deal! Alre...,"[@virginamerica, great, deal, !, alreadi, thin..."
14,@VirginAmerica Thanks!,"[@virginamerica, thank, !]"
16,@VirginAmerica So excited for my first cross c...,"[@virginamerica, excit, first, cross, countri,..."
17,@VirginAmerica I flew from NYC to SFO last we...,"[@virginamerica, flew, nyc, sfo, last, week, f..."
18,I ❤️ flying @VirginAmerica. ☺️👍,"[❤, fli, @virginamerica, ☺, 👍]"


### Encoding target variable numerically

In [14]:
df['positive']=(df['airline_sentiment']=='positive').astype(int)
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,tokens,positive
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),"[@virginamerica, realli, aggress, blast, obnox...",0
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),"[@virginamerica, realli, big, bad, thing]",0
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada),"[@virginamerica, serious, would, pay, thirty, ...",0
9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada),"[@virginamerica, amaz, arriv, hour, earli, good]",1
11,570289724453216256,positive,1.0,,,Virgin America,,HyperCamiLax,,0,@VirginAmerica I &lt;3 pretty graphics. so muc...,,2015-02-24 10:30:40 -0800,NYC,America/New_York,"[@virginamerica, <3, pretti, graphic, much, be...",1


#### Saving work and only keeping columns which we will use later

In [15]:
df = df[['airline_sentiment', 'airline', 'tokens', 'positive', 'text', 'negativereason']]
pd.DataFrame(df).to_csv('/Users/rashidbaset/Code/cap_project/_data/processed/text_processed.csv', index=False)