<a href="https://colab.research.google.com/github/zac-prattson/airline-sentiment-analysis/blob/main/Airline_Sentiment_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Natural Language Processing
Code adapted from PyCaret NLP Tutorial. 

Will focus on sentiment analysis from Kaggle dataset regarding Twitter Airline Sentiment.

In [2]:
!pip install pycaret



In [3]:
from pycaret.utils import enable_colab
enable_colab()

Colab mode enabled.


### Import the dataset
Use pandas to turn .csv into a dataframe and setup the environment.

Airline sentiment data from [Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/version/4)

In [4]:
import pandas as pd

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
tweet_dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/Tweets.csv')
print(tweet_dataset)

                 tweet_id  ...               user_timezone
0      570306133677760513  ...  Eastern Time (US & Canada)
1      570301130888122368  ...  Pacific Time (US & Canada)
2      570301083672813571  ...  Central Time (US & Canada)
3      570301031407624196  ...  Pacific Time (US & Canada)
4      570300817074462722  ...  Pacific Time (US & Canada)
...                   ...  ...                         ...
14635  569587686496825344  ...                         NaN
14636  569587371693355008  ...                         NaN
14637  569587242672398336  ...                         NaN
14638  569587188687634433  ...  Eastern Time (US & Canada)
14639  569587140490866689  ...                         NaN

[14640 rows x 15 columns]


In [15]:
tweet_dataset = tweet_dataset.sample(3000, random_state=786).reset_index(drop=True)
tweet_dataset.shape

(3000, 15)

In [17]:
from pycaret.nlp import *
nlp = setup(data = tweet_dataset, target = 'text', session_id=123)

Description,Value
session_id,123
Documents,3000
Vocab Size,3017
Custom Stopwords,False


### Create Topic Model #1

Assign topic proportions to the dataset for analysis.

In [18]:
lda = create_model('lda')
print(lda)

LdaModel(num_terms=3017, num_topics=4, decay=0.5, chunksize=100)


In [19]:
lda_two = create_model('lda', num_topics = 6, multi_core = True)
print(lda_two)

LdaModel(num_terms=3017, num_topics=6, decay=0.5, chunksize=100)


In [20]:
lda_results = assign_model(lda)
lda_results.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Topic_0,Topic_1,Topic_2,Topic_3,Dominant_Topic,Perc_Dominant_Topic
0,569174723877146625,negative,0.6725,Customer Service Issue,0.6725,Delta,,onlyfordisplay_,,0,hold,,2015-02-21 08:40:03 -0800,new york.,,0.153166,0.105969,0.513745,0.22712,Topic 2,0.51
1,569532793530785792,negative,1.0,Customer Service Issue,1.0,US Airways,,Harmony1412,,0,usairway get rebooke wake try call twice get h...,,2015-02-22 08:22:54 -0800,Wonderland,Indiana (East),0.090689,0.063183,0.458617,0.387512,Topic 2,0.46
2,569770363623575552,positive,1.0,,,Virgin America,,SamBrittenham,,0,team run tonight wait delay flight keep thing ...,,2015-02-23 00:06:55 -0800,USA,Eastern Time (US & Canada),0.101497,0.070473,0.678488,0.149541,Topic 2,0.68
3,568129940371214336,negative,1.0,Customer Service Issue,1.0,US Airways,,navydocdro,,0,usairway would kill give second bad muzak inst...,,2015-02-18 11:28:27 -0800,"okinawa, japan",,0.312269,0.066755,0.395534,0.225442,Topic 2,0.4
4,569003484701282304,negative,0.6955,Cancelled Flight,0.6955,Southwest,,ParkerBrown_,,0,southwestair get text flight help please,,2015-02-20 21:19:36 -0800,The Ohio State University,Quito,0.114474,0.079298,0.587548,0.21868,Topic 2,0.59


In [21]:
plot_model()

In [22]:
plot_model(plot = 'bigram')

In [23]:
plot_model(lda, plot = 'topic_distribution')

In [27]:
evaluate_model(lda)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Frequency Plot', 'freque…

In [28]:
save_model(lda, 'Airline Sentiment Model')

Model Succesfully Saved


(<gensim.models.ldamodel.LdaModel at 0x7f27d512db10>,
 'Airline Sentiment Model.pkl')

In [29]:
saved_lda = load_model('Airline Sentiment Model')

Model Sucessfully Loaded


In [30]:
print(saved_lda)

LdaModel(num_terms=3017, num_topics=4, decay=0.5, chunksize=100)


### Create Topic Model #2

Assign topic proportions to the dataset for analysis.

In [31]:
from pycaret.nlp import *

In [41]:
nlp_two = setup(data = tweet_dataset, target = 'text', session_id = 456,
                custom_stopwords = ['flight', 'thank', 'time', 'delay', 
                                    'plane', 'check', 'need', 
                                    'want', 'day', 'issue', 
                                    'never', 'send', 'pay', 'weather', 
                                    'home', 'new', 'long', 'much', 'reservation',
                                    'response', 'new'],
                log_experiment = True, experiment_name = 'kiva1'
                )

Description,Value
session_id,456
Documents,3000
Vocab Size,2977
Custom Stopwords,True


In [42]:
model = create_model('lda')

In [43]:
plot_model(lda, plot = 'topic_distribution')

In [44]:
tuned_unsupervised = tune_model(model = 'lda', multi_core = True)

IntProgress(value=0, description='Processing: ', max=25)

Output()

Best Model: Latent Dirichlet Allocation | # Topics: 64 | Coherence: 0.5377


In [52]:
tuned_classification = tune_model(model = 'lda', multi_core = True, supervised_target = 'retweet_count')

IntProgress(value=0, description='Processing: ', max=25)

Output()

ValueError: ignored