Skip to content

Apply CNN on word embeddings of twitter data to predict sentiment of tweets.

Notifications You must be signed in to change notification settings

samshipengs/Twitter-Sentiment-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Sentiment Analysis

Analyze sentiment of tweets i.e. positive, negative and neutral by applying convolution neural network on vector representations of words using Word2Vec. US Airline data is used in the demonstration.

Data


Airline Twitter sentiment

Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).

Prerequisites


  • lasagne - Create conv-net
  • nltk - Data pre-processing
  • sklearn - Provide useful tools e.g. stratified cross-validation

Getting Started


Begin by creating a directory e.g. twitter_sentiment for stroing training data, Word2Vec model and CNN model, and set the FILE_PATH to this directory.

  • data: contains training data (airline data in this case) and test data.
  • wordvec: word embedding model is saved here.
  • model: cnn model is saved here.

Run


train cnn using model_airline,

jupyter notebook model_airline.ipynb

it could take some time to finish, and when it is done a cnn.npz file would be created.

make predictions on twitter data,

jupyter notebook predictions.ipynb

Example


airline_data = Data('Airline-Sentiment-2-w-AA.csv', FILE_PATH)
airline_df = airline_data.csv_df(['airline_sentiment', 'text']) # load data
airline_data.pre_process(airline_df) # pre-process data
airline_df.head()
airline_sentiment text tokenized
0 neutral What said [said]
1 positive plus youve added commercials to the experienc... [plus, youve, added, commercials, experience, ...
2 neutral I didnt today Must mean I need to take anothe... [didnt, today, must, mean, need, take, another...
3 negative its really aggressive to blast obnoxious ente... [really, aggressive, blast, obnoxious, enterta...
4 negative and its a really big bad thing about it [really, big, bad, thing]

model gets trained

train freq [1890 2479 2479]
val freq [473 620 620]
Extracting ... 
Extracting ... 
Training cv 1 ...
[LibSVM]0.880910683012


train freq [1890 2479 2479]
val freq [473 620 620]
Extracting ... 
Extracting ... 
Training cv 2 ...
[LibSVM]0.879743140689


train freq [1890 2479 2479]
val freq [473 620 620]
Extracting ... 
Extracting ... 
Training cv 3 ...
[LibSVM]0.877408056042

then use it to predict on collected airline tweets.

Group the tweets based on the sentiment classified by CNN model, and we can find the most frequent words from each group,

ALL = Prediction(FILE_PATH, 'FOUR_AIRLINES.csv', max_len_train=19)
ALL.prepare_data(['text', 'airline'], wv_size=600)
ALL.get_result(n_preview=10, n_top = 20, name='ALL_result',verbose=False)
===Positive===
[('thanks', 170), ('thank', 139), ('great', 136), ('flight', 120), ('service', 65), 
 ('love', 48), ('fly', 44), ('crew', 39), ('leggings', 38), ('best', 38), 
 ('flying', 35), ('much', 34), ('night', 34), ('good', 34), ('always', 32),
 ('us', 31), ('home', 31), ('time', 30), ('last', 30), ('got', 30)]

png

===Negative===
[('flight', 400), ('get', 155), ('stop', 144), ('tickets', 137), ('time', 128),
 ('seaworld', 127), ('selling', 123), ('via', 122), ('urge', 121), ('service', 108),
 ('customer', 100), ('still', 95), ('one', 92), ('delayed', 91), ('flights', 83), 
 ('us', 78), ('bag', 75), ('flying', 74), ('hours', 72), ('hour', 70)]

png

Take a look of the context of some of the most frequent word used in negative grouped tweets,

ALL.check(word='flight', sentiment=3, n_view=10)
"@AmericanAir with that totally random flight cancellation <ed><U+00A0><U+00BD><ed><U+00B1><U+0080>"
"Baffled by @AmericanAir boarding  passengers with full knowledge that the captain is still on an inbound  flight. SMH. #FAIL"
"#TFW you finally get off your @AmericanAir flight that taxied for over an hour. <ed><U+00A0><U+00BD><ed><U+00B9><U+008C><ed><U+00A0><U+00BC><ed><U+00BF><U+00BE><ed><U+00A0><U+00BC><ed><U+00BE><U+0089><ed><U+00A0><U+00BC><ed><U+00BE><U+0089><ed><U+00A0><U+00BD><ed><U+00B1><U+008F><ed><U+00A0><U+00BC><ed><U+00BF><U+00BD> https://t.co/TyAlTpAWFC"
"@AmericanAir unfortunately made the mistake of booking @united, who put me on an @aircanada flight. Next time! <ed><U+00A0><U+00BD><ed><U+00B8><U+0093>"
"@AmericanAir the flight attend made me stow iPad for landing b/c it "has a keypad". Travel weekly. Never had to stow. Is this new? #AA1164"
"@AmericanAir I have a question about my seats on upcoming flight."
"@AmericanAir I joined Twitter to tell people how bad you are! Second time missing a connecting flight in LA but somehow isn't your fault!"
"RT @RandyStillinger: World War II veterans get a hero's welcome by the #AATeam upon arrival on @AmericanAir #SoaringValor charter flight wi�"
"At LAS with a colleague waiting for our golf clubs because they missed our flight. Might not make our tee time. Unacceptable @AmericanAir"
"@AmericanAir I'm not Ralph ;) That's your awesome flight attendant."

This could help airline to improve on relevant services.

Reference


Convolutional Neural Networks for Sentence Classification

About

Apply CNN on word embeddings of twitter data to predict sentiment of tweets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published