As I am new to NLP, this will be a simple classification using only the 'text' and 'airline_sentiment ' columns to get my feet wet =)

In [None]:
import pandas as pd

In [None]:
tweets = pd.read_csv("../input/Tweets.csv")

In [None]:
tweets.head()

In [None]:
# extract only the text and airline_sentiment columns
df = tweets[['text','airline_sentiment']]

In [None]:
df.isnull().sum()

In [None]:
df.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

In [None]:
X = df['text']
y = df['airline_sentiment']

In [None]:
#encode sentiment categories
le = LabelEncoder()
le.fit_transform(y)

In [None]:
df['airline_sentiment'].head()
# encoding: neutral = 1, positive = 2, negative = 0

## Naive attempt

Attempt classification without cleaning the X data

In [None]:
naive_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('lm', LogisticRegression())
    
])

scores = cross_val_score(naive_pipe, X,y, cv = 5)
print('Mean score: ',scores.mean())

As we can see, the accuracy's pretty decent even without cleaning the X data.

## 2nd attempt with data cleaning

We're going to remove stopwords and any unnecessary punctuation. Hopefully this will produce better results!

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from sklearn.base import BaseEstimator,TransformerMixin

In [None]:
class PunctuationRemover(BaseEstimator,TransformerMixin):
    def fit(self, column, y = None):
        return self
    
    def removePunctuation(self,text,punctuation, y = None):
        clean_words = []
        
        for element in word_tokenize(text):
            if element not in punctuation:
                clean_words.append(element)

        clean_text = ' '.join(clean_words)
        return clean_text
    
    def transform(self, column, y = None):
        punctuation = set(string.punctuation)
        return column.apply(lambda x: self.removePunctuation(x,punctuation))

In [None]:
class StopwordRemover(BaseEstimator,TransformerMixin):
    def fit(self, column, y = None):
        return self
    
    def removeStopwords(self,text,stop_words, y = None):
        clean_words = []
        
        for element in text.lower().split():
            if element not in stop_words:
                clean_words.append(element)

        clean_text = ' '.join(clean_words)
        return clean_text
    
    def transform(self, column, y = None):
        stop_words = set(stopwords.words('english'))
        return column.apply(lambda x: self.removeStopwords(x,stop_words))

In [None]:
# visualize the newly-transformed text
visualization_pipe = Pipeline([
    ('sw', StopwordRemover()),
    ('punc', PunctuationRemover()),
])

pd.DataFrame(visualization_pipe.fit_transform(X)).head()

In [None]:
pipe2 = Pipeline([
    ('sw', StopwordRemover()),
    ('punc', PunctuationRemover()),
    ('cv', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('lm', LogisticRegression())
    
])

scores = cross_val_score(pipe2, X,y, cv = 5)
print('Mean score: ',scores.mean())

Interesting! Our accuracy dropped after cleaning the data. Upon further examination below, it seems like clearing out the punctuation could have removed some important features of the text such as smileys and exclamation marks. These could have been helpful in determining the sentiment of the airlines.  As seen below, smileys appear to be a common way to express approval (or disapproval).

In [None]:
smileys = [r'=(',r'=)',r':)',r':(']

for text in df['text']:
    for smiley in smileys:
        if smiley in text:
            print (text)
    

Further examination can be done on the data set for sure (like above), but I'll stop here for now. Hope you guys enjoyed reading through my first time with NLP!