In [None]:
Image("../input/giffyy/giphy.gif")

**When you watch friends, you just don't watch it, you live it. Every character has its own way to be funny. 

This note book is my first experiement in NLP.**

In [None]:
from IPython.display import Image
import os
!ls ../input/

In [None]:
import pandas as pd
import numpy as np
from pandas import DataFrame

In [None]:
train = pd.read_csv('../input/friends-transcript/friends_quotes.csv')

In [None]:
train

One of the most basic features we can extract is the number of words in each quote. The basic intuition behind this is that generally, the negative sentiments contain a lesser amount of words than the positive ones.

In [None]:
train['word_count'] = train['quote'].apply(lambda x: len(str(x).split(" ")))
train[['quote','word_count']].head()

Number of characters
This feature is also based on the previous feature intuition. Here, we calculate the number of characters in each tweet. This is done by calculating the length of the quote.

In [None]:
train['char_count'] = train['quote'].str.len() ## this also includes spaces
train[['quote','char_count']].head()

Average Word Length
We will also extract another feature which will calculate the average word length of each quote. This can also potentially help us in improving our model.

Here, we simply take the sum of the length of all the words and divide it by the total length of the quote:

In [None]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['episode_title'].apply(lambda x: avg_word(x))
train[['quote','avg_word']].head()

Number of stopwords
Generally, while solving an NLP problem, the first thing we do is to remove the stopwords. But sometimes calculating the number of stopwords can also give us some extra information which we might have been losing before.

Here, we have imported stopwords from NLTK, which is a basic NLP library in python.

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

train['stopwords'] = train['quote'].apply(lambda x: len([x for x in x.split() if x in stop]))
train[['quote','stopwords']].head()

Number of special characters
One more interesting feature which we can extract from a tweet is calculating the number of hashtags or mentions present in it. This also helps in extracting extra information from our text data.

Here, we make use of the ‘starts with’ function because hashtags (or mentions) always appear at the beginning of a word.

In [None]:
train['hastags'] = train['quote'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train[['quote','hastags']].head()

Number of numerics
Just like we calculated the number of words, we can also calculate the number of numerics which are present in the tweets. It does not have a lot of use in our example, but this is still a useful feature that should be run while doing similar exercises. For example, 

In [None]:
train['numerics'] = train['quote'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train[['quote','numerics']].head()

Number of Uppercase words
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [None]:
train['upper'] = train['quote'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train[['quote','upper']].head()

Lower case
The first pre-processing step which we will do is transform our tweets into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

In [None]:
train['quote'] = train['quote'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['quote'].head()

Removing Punctuation
The next step is to remove punctuation, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data.

In [None]:
train['quote'] = train['quote'].str.replace('[^\w\s]','')
train['quote'].head()

Removal of Stop Words
As we discussed earlier, stop words (or commonly occurring words) should be removed from the text data. For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries.

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['quote'] = train['quote'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train['quote'].head()

Common word removal
Previously, we just removed commonly occurring words in a general sense. We can also remove commonly occurring words from our text data First, let’s check the 10 most frequently occurring words in our text data then take call to remove or retain.

In [None]:
freq = pd.Series(' '.join(train['quote']).split()).value_counts()[:10]
freq

Now, let’s remove these words as their presence will not of any use in classification of our text data.

In [None]:
freq = list(freq.index)
train['quote'] = train['quote'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['quote'].head()

 Rare words removal
Similarly, just as we removed the most common words, this time let’s remove rarely occurring words from the text. Because they’re so rare, the association between them and other words is dominated by noise. You can replace rare words with a more general form and then this will have higher counts

In [None]:
freq = pd.Series(' '.join(train['quote']).split()).value_counts()[-10:]
freq

In [None]:
freq = list(freq.index)
train['quote'] = train['quote'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['quote'].head()

Spelling check

In [None]:
from textblob import TextBlob
train['quote'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [None]:
TextBlob(train['quote'][1]).words

Steaming

In [None]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
train['quote'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

Lemmatization

In [None]:
from textblob import Word
train['quote'] = train['quote'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['quote'].head()

-ngrams
N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

In [None]:
TextBlob(train['quote'][0]).ngrams(3)

term frequency
Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

Therefore, we can generalize term frequency as:

TF = (Number of times term T appears in the particular row) / (number of terms in that row)

In [None]:
tf1 = (train['quote'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

Inverse Document Frequency
The intuition behind inverse document frequency (IDF) is that a word is not of much use to us if it’s appearing in all the documents.

In [None]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['quote'].str.contains(word)])))

tf1

Term Frequency – Inverse Document Frequency (TF-IDF)
TF-IDF is the multiplication of the TF and IDF which we calculated above.

In [None]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because they are commonly occurring words. However, it has given a high weight to “disappointed” since that will be very useful in determining the sentiment of the tweet.

We don’t have to calculate TF and IDF every time beforehand and then multiply it to obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['quote'])

train_vect

Bag of Words
Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['quote'])
train_bow

Sentiment Analysis
If you recall, our problem was to detect the sentiment of the tweet. So, before applying any ML/DL models (which can have a separate feature detecting the sentiment using the textblob library), let’s check the sentiment of the first few tweets.

In [None]:
train['quote'][:5].apply(lambda x: TextBlob(x).sentiment)

Above, you can see that it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. This can also work as a feature for building a machine learning model.

In [None]:
train['sentiment'] = train['quote'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['quote','sentiment']].head()

In [None]:
import matplotlib.pyplot as plt
Sentiment_count=train.groupby('sentiment').count()
plt.bar(Sentiment_count.index.values, Sentiment_count['quote'])
plt.xlabel('Review Sentiments')
plt.ylabel('Number of Review')
plt.show()

In [None]:
#WordCloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
plt.rcParams['font.size']= 15              
plt.rcParams['savefig.dpi']= 100         
plt.rcParams['figure.subplot.bottom']= .1

In [None]:
plt.figure(figsize=(15,15))
stopwords = set(STOPWORDS)

wordcloud = WordCloud(background_color='black', stopwords=stopwords, max_words=2000, max_font_size=80,
                      random_state=420).generate(str(train['quote']))
print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.title("Friends [WordCloud]")
plt.axis('off')
plt.show()

In [None]:
plt.figure(figsize=(15,15))
stopwords = set(STOPWORDS)

wordcloud = WordCloud(background_color='black', stopwords=stopwords, max_words=2000, max_font_size=100,
                      random_state=420).generate(str(train['quote']))
print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.title("Friends <3 [WordCloud]")
plt.axis('off')
plt.show()