### NLP Sentiment Analysis Exercise

In [1]:
import numpy as np 
import pandas as pd 
import re
import nltk 
import string

In [2]:
# load data
data_source_url = "https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv"
airline_tweets = pd.read_csv(data_source_url)

**Task:** Print the top 5 rows.

In [3]:
airline_tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


**Task:** Use the `'text'` column to create an array with the name `'features'`.



In [4]:
features = airline_tweets['text']

In [5]:
features[:5]

0                  @VirginAmerica What @dhepburn said.
1    @VirginAmerica plus you've added commercials t...
2    @VirginAmerica I didn't today... Must mean I n...
3    @VirginAmerica it's really aggressive to blast...
4    @VirginAmerica and it's a really big bad thing...
Name: text, dtype: object

**Task:** Use `'airline_sentiment'` column to create an array with the name `'labels'`.

In [6]:
labels = airline_tweets['airline_sentiment']

In [7]:
labels[:5]

0     neutral
1    positive
2     neutral
3    negative
4    negative
Name: airline_sentiment, dtype: object

**Task:** Clean the text data in the `'features'` array.

    - Remove all the special characters.
    - Remove all single characters.
    - Remove single characters from the start.
    - Substituting multiple spaces with single space.
    - Converting all text to lowercase.

In [8]:
processed_features = []

for sentence in features:
    # Remove all the special characters
    tokens = nltk.tokenize.word_tokenize(sentence)
    words = [word for word in tokens if word.isalpha()]
    
    # Remove all single characters
    words = [word for word in words if len(word) > 1]

    # Remove single characters from the start
    ########### taken care above

    # Substituting multiple spaces with single space
    ########### taken care in tokenize

    # Converting to Lowercase
    words = [word.lower() for word in words]
    
    # join words to form sentences
    joined = ' '.join(words)
    
    # Add back the sentences
    processed_features.append(joined)

**Task:** Import stopwords from nltk.

In [9]:
from nltk.corpus import stopwords

**Task:** Import TfidfVectorizer from sklearn.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

**Task:** Instatiate TfidfVectorizer with following parameters:

    - max_features = 2500
    - min_df = 7
    - max_df = 0.8
    - stop_words = stopwords.words('english')
    
    


In [11]:
stop_words = stopwords.words('english')
tfidf_vec = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stop_words)

**Bonus:** How would you determine optimal paraemeters for TfidfVectorizer? Discuss with your peers and/or mentors. Write down your answer below.

**Task:** Transform features with vectorizer. 

In [16]:
X = tfidf_vec.fit_transform(processed_features)

**Task:** Import train_test_split from sklearn and split the data.

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)

**Task:** Import any classifier of your choice from sklearn (e.g. Random Forest, LogReg, Naive Bayes).

In [14]:
from sklearn.naive_bayes import MultinomialNB

**Task:** Fit your classifier to data.

In [18]:
nb = MultinomialNB()
model = nb.fit(X_train, y_train)

**Task:** Predict X_test.

In [20]:
y_pred = model.predict(X_test)

**Task:** Import confusion matrix and accuracy_score.

In [21]:
from sklearn import metrics

**Task:** Print confusion matrix.

In [22]:
metrics.confusion_matrix(y_test, y_pred)

array([[1765,   45,    7],
       [ 403,  197,   24],
       [ 248,   48,  191]], dtype=int64)

In [24]:
model.classes_

array(['negative', 'neutral', 'positive'], dtype='<U8')

**Task:** Print accaccuracy_score.

In [23]:
metrics.accuracy_score(y_test, y_pred)

0.7353142076502732